I'm using sklearn to train a decision tree classifier.
But there is a weird thing happened.
The accuracy returned by the decision tree's score function(0.88) is much higher than the cross_val_score
(around 0.84).
According to the document, the score function also calculates the mean accuracy.
Both of them are applied to the test dataset(87992 samples).
The cross-validation calculates on subsets, it makes sense if the result is slightly different, but now the difference is quite large.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train)
print('Accuracy: %f' % clf_tree.score(X_test, y_test))
print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))
print(classification_report(clf_tree.predict(X_test), y_test))
Output:
Accuracy: 0.881262
[0.84022727 0.83875 0.843164 0.84020911 0.84714172 0.83929992 0.83873167 0.8422548 0.84089101 0.84111831]
precision recall f1-score support
0 0.89 0.88 0.88 44426
1 0.88 0.89 0.88 43566
micro avg 0.88 0.88 0.88 87992
macro avg 0.88 0.88 0.88 87992
weighted avg 0.88 0.88 0.88 87992
What's really going on here? Thanks for any advice.