5
votes

I'm trying to understand how sklearn cross validation and scoring works and am observing some odd behavior.

I instantiate a classifier, then do 4 fold cross validation on it, getting 4 scores in the range of 90% accuracy +- 0.5%.

I then refit the model on all the training data, and score it on the test data. I'm also scoring it here in this code on the training data, just to prove a point.

I run this code after splitting my data into test and train sets.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, balanced_accuracy_score

gbc = GradientBoostingClassifier()

scores = cross_val_score(gbc, X_train, y_train, cv=4, scoring=make_scorer(balanced_accuracy_score))

print('cv scores: ', scores)
print('cv scores mean: ', scores.mean())

gbc.fit(X_train, y_train)

print('test score on test: ', balanced_accuracy_score(gbc.predict(X_test), y_test))
print('test score on train: ', balanced_accuracy_score(gbc.predict(X_train), y_train))

which prints:

cv scores:  [0.89523728 0.90348769 0.90412818 0.89991599]
cv scores mean:  0.900692282366262
test score on test:  0.8684604909814304
test score on train:  0.874880530883581

I would expect the test score on test output to be in that same range as the cross validated scores, and I would expect the test score on train output to show bad overfitting, and thus an artificially much higher accuracy than the cross validated scores.

Why then do I consistently those scores as 3-4% worse than the cross validated scores?

1
Can you provide access to your training and testing data through a link?Juan Carlos Ramirez
I just tried the code with iris dataset and train_test_split ratio = 0.3 and got opposite result. So, everything depends on the dataset your are using and how it was splitted. It seems that test subset in your case includes some important information about relationship between grouping variable and feature space that isn't presented in (X_train, y_train) subsets.bubble
To make a generalization based in only one run of the algorithm can be dangerous. You should better run your cv, say, 100 times (using different train splits for each one) and get the average of the cv scores. Then, you should fit another 100 models with the whole training set (again, different each time) and average the scores obtained for the corresponding test set. These numbers should be very close, with the second maybe a little higher since the training set is bigger.Pablo

1 Answers

0
votes

This is how cross validation works:

enter image description here

So basically per every iteration the data is splitted in a new way and a test is running against it.

I'm trying to understand how sklearn cross validation and scoring works and am observing some odd behavior.

What could be interessted in your case? The len(X) can be important. Imagen when you use a normal fit method, without cross validation, you are splitting in 700 training set and 300 test set. Whereever in cross validation with cv=4, this would be 800 training set and 200 test set. This can give different results.

What does it mean for your interpretation? That your dataset is quite sensitive of the splitting behavior. Maybe it could be a good idea to collect more data, and I would highly recommend you to use Cross-Validation in that cause, otherwise you can have bad prediction results later, altough you think you have a good prediction method.