I'm trying to understand how sklearn
cross validation and scoring works and am observing some odd behavior.
I instantiate a classifier, then do 4 fold cross validation on it, getting 4 scores in the range of 90% accuracy +- 0.5%.
I then refit the model on all the training data, and score it on the test data. I'm also scoring it here in this code on the training data, just to prove a point.
I run this code after splitting my data into test and train sets.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, balanced_accuracy_score
gbc = GradientBoostingClassifier()
scores = cross_val_score(gbc, X_train, y_train, cv=4, scoring=make_scorer(balanced_accuracy_score))
print('cv scores: ', scores)
print('cv scores mean: ', scores.mean())
gbc.fit(X_train, y_train)
print('test score on test: ', balanced_accuracy_score(gbc.predict(X_test), y_test))
print('test score on train: ', balanced_accuracy_score(gbc.predict(X_train), y_train))
which prints:
cv scores: [0.89523728 0.90348769 0.90412818 0.89991599]
cv scores mean: 0.900692282366262
test score on test: 0.8684604909814304
test score on train: 0.874880530883581
I would expect the test score on test
output to be in that same range as the cross validated scores, and I would expect the test score on train
output to show bad overfitting, and thus an artificially much higher accuracy than the cross validated scores.
Why then do I consistently those scores as 3-4% worse than the cross validated scores?
train_test_split ratio = 0.3
and got opposite result. So, everything depends on the dataset your are using and how it was splitted. It seems thattest
subset in your case includes some important information about relationship between grouping variable and feature space that isn't presented in (X_train, y_train) subsets. – bubble