Plot decision boundaries of classifier, ValueError: X has 2 features per sample; expecting 908430"

Question

Based on the scikit-learn document http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py. I try to plot a decision boundaries of the classifier, but it sends a error message call "ValueError: X has 2 features per sample; expecting 908430" for this code "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])"

clf = SGDClassifier().fit(step2, index)  
X=step2
y=index
h = .02
colors = "bry"
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                 np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis('off')

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

the 'index' is a label which contain around [98579 X 1] label for the comment which include positive, natural and negative

array(['N', 'N', 'P', ..., 'NEU', 'P', 'N'], dtype=object)

the 'step2' is the [98579 X 908430] numpy matrix which formed by the Countvectorizer function, which is about the comment data

<98579x908430 sparse matrix of type '<type 'numpy.float64'>'
with 3168845 stored elements in Compressed Sparse Row format>

lejlot lejlot · Accepted Answer · 2016-10-19T21:09:20

The thing is you cannot plot decision boundary for a classifier for data which is not 2 dimensional. Your data is clearly high dimensional, it has 908430 dimensions (NLP task I assume). There is no way to plot actual decision boundary for such a model. Example that you are using is trained on 2D data (reduced Iris) and this is the only reason why they were able to plot it.

Plot decision boundaries of classifier, ValueError: X has 2 features per sample; expecting 908430"

1 Answers