Difference between .score() and .predict in the sklearn library?

Question

I have instantiated a SVC object using the sklearn library with the following code:

clf = svm.SVC(kernel='linear', C=1, cache_size=1000, max_iter = -1, verbose = True)

I then fit data to it using:

model = clf.fit(X_train, y_train)

Where X_train is a (301,60) and y_train is (301,) ndarray (y_train consisting of class labels "1", "2" and "3").

Now, before I stumbled across the .score() method, to determine the accuracy of my model on the training set i was using the following:

prediction = np.divide((y_train == model.predict(X_train)).sum(), y_train.size, dtype = float)

which gives a result of approximately 62%.

However, when using the model.score(X_train, y_train) method I get a result of approximately 83%.

Therefore, I was wondering if anyone could explain to me why this should be the case because as far as I understand, they should return the same result?

ADDENDUM:

The first 10 values of y_true are:

2, 3, 1, 3, 2, 3, 2, 2, 3, 1, ...

Whereas for y_pred (when using model.predict(X_train)), they are:

2, 3, 3, 2, 2, 3, 2, 3, 3, 3, ...

That's weird, can you post some subset of your data (at least some y_true and y_pred values)? — elyase

Andreas Mueller Andreas Mueller · Accepted Answer · 2015-01-23T03:22:56

Because your y_train is (301, 1) and not (301,) numpy does broadcasting, so

(y_train == model.predict(X_train)).shape == (301, 301)

which is not what you intended. The correct version of your code would be

np.mean(y_train.ravel() == model.predict(X_train))

which will give the same result as

model.score(X_train, y_train)

Difference between .score() and .predict in the sklearn library?

1 Answers