0
votes

Based on the scikit-learn document http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py. I try to plot a decision boundaries of the classifier, but it sends a error message call "ValueError: X has 2 features per sample; expecting 908430" for this code "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])"

clf = SGDClassifier().fit(step2, index)  
X=step2
y=index
h = .02
colors = "bry"
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                 np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis('off')

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

the 'index' is a label which contain around [98579 X 1] label for the comment which include positive, natural and negative

array(['N', 'N', 'P', ..., 'NEU', 'P', 'N'], dtype=object)

the 'step2' is the [98579 X 908430] numpy matrix which formed by the Countvectorizer function, which is about the comment data

<98579x908430 sparse matrix of type '<type 'numpy.float64'>'
with 3168845 stored elements in Compressed Sparse Row format>
1

1 Answers

1
votes

The thing is you cannot plot decision boundary for a classifier for data which is not 2 dimensional. Your data is clearly high dimensional, it has 908430 dimensions (NLP task I assume). There is no way to plot actual decision boundary for such a model. Example that you are using is trained on 2D data (reduced Iris) and this is the only reason why they were able to plot it.