0
votes

I'm using sklearn to train a decision tree classifier.

But there is a weird thing happened.

The accuracy returned by the decision tree's score function(0.88) is much higher than the cross_val_score(around 0.84).

According to the document, the score function also calculates the mean accuracy.
Both of them are applied to the test dataset(87992 samples).
The cross-validation calculates on subsets, it makes sense if the result is slightly different, but now the difference is quite large.

from sklearn.tree import DecisionTreeClassifier  
from sklearn.model_selection import cross_val_score

clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train)

print('Accuracy: %f' % clf_tree.score(X_test, y_test))
print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))
print(classification_report(clf_tree.predict(X_test), y_test))

Output:

Accuracy: 0.881262

[0.84022727 0.83875    0.843164   0.84020911 0.84714172 0.83929992 0.83873167 0.8422548  0.84089101 0.84111831]

              precision    recall  f1-score   support

           0       0.89      0.88      0.88     44426
           1       0.88      0.89      0.88     43566

   micro avg       0.88      0.88      0.88     87992
   macro avg       0.88      0.88      0.88     87992
weighted avg       0.88      0.88      0.88     87992

What's really going on here? Thanks for any advice.

1

1 Answers

3
votes

You have a missunderstanding of what cross_val_score does.

Assuming you have a Dataset with 100 rows and split that into train (70%) and test (30%) then you will train with 70 rows and test with 30 in the following part of your code:

clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train) 
print('Accuracy: %f' % clf_tree.score(X_test, y_test))

Later on the other hand you call

print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))

Here cross_val_score takes your 30 rows of test data and splits them in to 10 parts. Then it uses 9 parts for training and 1 part to test that completly new trained classifier. That will be repeated until each block was tested once (10 times).

So at the end your first classifier was trained with 70% of your data, while the 10 classifiers of your cross_val_score where trained with 27% of your data.

And often in machine learning we see that more data gets better results.

To make the point clear. In your code the following two lines would do exactly the same:

print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))

print((cross_val_score(DecisionTreeClassifier(), X_test, y_test, cv=10, scoring='accuracy')))