0
votes

I input data in LIBSVM format like this into a SciPy sparse matrix. The training set is multi-label and multi-class as described in this question I asked: Understanding format of data in scikit-learn

from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("train-subset100.csv.csv", multilabel = True, zero_based = True)

Then I employ OneVsRestClassifier with LinearSVC to train the data.

clf = OneVsRestClassifier(LinearSVC())
clf.fit(X, Y)

Now when I want to test the data, I do the following.

X_, Y_ = load_svmlight_file("train-subset10.csv", multilabel = True, zero_based = False)
predicted = clf.predict(X_)

Here it gives me error. I dump the traceback here as it is.

Traceback (most recent call last):

File "test.py", line 36, in

predicted = clf.predict(X_)

File "/usr/lib/pymodules/python2.7/sklearn/multiclass.py", line 151, in predict

return predict_ovr(self.estimators_, self.label_binarizer_, X)

File "/usr/lib/pymodules/python2.7/sklearn/multiclass.py", line 67, in predict_ovr

Y = np.array([_predict_binary(e, X) for e in estimators])

File "/usr/lib/pymodules/python2.7/sklearn/multiclass.py", line 40, in _predict_binary

return np.ravel(estimator.decision_function(X))

File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 728, in decision_function

self._check_n_features(X)

File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 748, in _check_n_features

X.shape[1]))

ValueError: X.shape[1] should be 3421, not 690.

I do not understand why is it looking for more features when the input format is a sparse matrix? How can I get it to predict test labels correctly?

1

1 Answers

1
votes

I solved the issue myself. The problem was that loading datasets one by one using SVMLIGHT/LIBSVM format expects the training matrices to have feature set of the same size. So there are two workarounds for it. One is that you input all data at once using load_svmlight_files command.

X,Y,X_,Y_ = load_svmlight_files("train-subset100.csv", "train-subset10.csv",... 

multilabel = True, zero_based = False)

Secondly you can mention the number of features explicitly.

X,Y=load_svmlight_file("train-subset100.csv",multilabel=True, zero_based = False)
X_,Y_ = load_svmlight_file("train-subset10.csv", n_features = X.shape[1],... 
multilabel = True, zero_based = False, )