Accuracy ranking of K-fold cross validation doesn't agree with accuracy ranking of individual model

Question

This is my first time running k-fold cross validation and I am getting confused about a phenomenon I saw from the output. Basically, the 5-fold cross validations consistently give model 8 (Adaboost Classifier) and model 9 (Gradient Boosting Classifier) the highest accuracy scores as you can see below. However, when I run those ML models individually by using 20% of the dataset as testing data, model 7 (Random Forest Classifier) always produce the highest accuracy among all 5 models according to the confusion matrix and AUC. My initial expectation was that a ML model with high k-fold cross validation accuracy should also return high accuracy if I were to run that ML model individually. That doesn't seem to be the case here. Could someone please explain to me why I am seeing this discrepancy?

These are the ML models I used to train the data:

model6 = DecisionTreeClassifier()
model7 = RandomForestClassifier(n_estimators=300)
model8 = AdaBoostClassifier(n_estimators=300)
model9 = GradientBoostingClassifier(n_estimators=300, learning_rate=1.0, max_depth=1, random_state=0)
model10 = KNeighborsClassifier(n_neighbors=5)

HERE ARE MY COMPLETE CODES FOR THE 5-FOLD CV AND INDIVIDUAL ML MODELS:

X_train, X_test, Y_train, Y_test = train_test_split(whole_data_input, whole_data_output, test_size=0.2)
X_train.reset_index(inplace=True)
#To remove the index column:
X_train.drop(['index'],axis=1,inplace=True)

X_test.reset_index(inplace=True)
#To remove the index column:
X_test.drop(['index'],axis=1,inplace=True)

Y_train.reset_index(inplace=True)
#To remove the index column:
Y_train.drop(['index'],axis=1,inplace=True)

Y_test.reset_index(inplace=True)
#To remove the index column:
Y_test.drop(['index'],axis=1,inplace=True)

warnings.filterwarnings('ignore')

model6 = DecisionTreeClassifier()
model7 = RandomForestClassifier(n_estimators=300)
model8 = AdaBoostClassifier(n_estimators=300)
model9 = GradientBoostingClassifier(n_estimators=300, 
learning_rate=1.0,max_depth=1, random_state=0)
model10 = KNeighborsClassifier(n_neighbors=5)

model6.fit(X_train, Y_train)
model7.fit(X_train, Y_train)
model8.fit(X_train, Y_train)
model9.fit(X_train, Y_train)
model10.fit(X_train, Y_train)

# Perform 5-fold cross validation across different models:

#Here I am calling 'whole_data['label'] instead of the 'whole_data[['label']] I created earlier because cross validation only works with this data shape:
whole_data_output=whole_data['label']    

print('THE FOLLOWING OUTPUT REPRESENT ACCURACIES OF 5-FOLD VALIDATIONS FROM VARIOUS ML MODELS:')
print()
scores = cross_val_score(model6, whole_data_input, whole_data_output, cv=5)
print('Cross-validated scores for model6, Decision Tree Classifier, is:' + str(scores))

print()
scores = cross_val_score(model7, whole_data_input, whole_data_output, cv=5)
print('Cross-validated scores for model7, Random Forest Classifier, is:' + str(scores))

print()
scores = cross_val_score(model8, whole_data_input, whole_data_output, cv=5)
print('Cross-validated scores for model8, Adaboost Classifier, is:' + str(scores))

print()
scores = cross_val_score(model9, whole_data_input, whole_data_output, cv=5)
print('Cross-validated scores for model9, Gradient Boosting Classifier, is:' + str(scores))

print()
scores = cross_val_score(model10, whole_data_input, whole_data_output, cv=5)
print('Cross-validated scores for model10, K Neighbors Classifier, is:' + str(scores))

print('THE FOLLOWING OUTPUT REPRESENT RESULTS FROM VARIOUS ML MODELS:')
print()

result6 = model6.predict(X_test)
result7 = model7.predict(X_test)
result8 = model8.predict(X_test)
result9 = model9.predict(X_test)
result10 = model10.predict(X_test)

from sklearn.metrics import classification_report

print('Classification report for model 6, decision tree classifier, is: ')
print(confusion_matrix(Y_test,result6))
print()
print(classification_report(Y_test,result6))
print()
print("Area under curve (auc) of model6 is: ", metrics.roc_auc_score(Y_test, result6)) 
print()

print('Classification report for model 7, random forest classifier, is: ')
print(confusion_matrix(Y_test,result7))
print()
print(classification_report(Y_test,result7))
print()
print("Area under curve (auc) of model7 is: ", metrics.roc_auc_score(Y_test, result7)) 
print()

print('Classification report for model 8, adaboost classifier, is: ')
print(confusion_matrix(Y_test,result8))
print()
print(classification_report(Y_test,result8))
print()
print("Area under curve (auc) of model8 is: ", metrics.roc_auc_score(Y_test, result8)) 
print()

print('Classification report for model 9, gradient boosting classifier, is: ')
print(confusion_matrix(Y_test,result9))
print()
print(classification_report(Y_test,result9))
print()
print("Area under curve (auc) of model9 is: ", metrics.roc_auc_score(Y_test, result9)) 
print()

print('Classification report for model 10, K neighbors classifier, is: ')
print(confusion_matrix(Y_test,result10))
print()
print(classification_report(Y_test,result10))
print()
print("Area under curve (auc) of model10 is: ", metrics.roc_auc_score(Y_test, result10)) 
print()

THE FOLLOWING OUTPUT REPRESENT ACCURACIES OF 5-FOLD CROSS VALIDATIONS FROM VARIOUS ML MODELS:

Cross-validated scores for model6, Decision Tree Classifier, is:[ 0.61364665  0.75754735  0.77046902]

Cross-validated scores for model7, Random Forest Classifier, is:[ 0.62463637  0.79326395  0.8073181 ]

Cross-validated scores for model8, Adaboost Classifier, is:[ 0.64916931  0.81960696  0.84196916]

Cross-validated scores for model9, Gradient Boosting Classifier, is:[ 0.64910466  0.82177258  0.83909235]

Cross-validated scores for model10, K Neighbors Classifier, is:[ 0.61180425  0.75412115  0.73012897]

THE FOLLOWING OUTPUT REPRESENT RESULTS FROM VARIOUS ML MODELS:

Classification report for model 6, decision tree classifier, is: 
[[6975 1804]
[1893 7891]]

         precision    recall  f1-score   support

     -1       0.79      0.79      0.79      8779
      1       0.81      0.81      0.81      9784
avg / total       0.80      0.80      0.80     18563

Area under curve (auc) of model6 is:  0.800515237805

Classification report for model 7, random forest classifier, is: 
[[6883 1896]
[1216 8568]]

         precision    recall  f1-score   support

     -1       0.85      0.78      0.82      8779
      1       0.82      0.88      0.85      9784
avg / total       0.83      0.83      0.83     18563

Area under curve (auc) of model7 is:  0.829872762782

Classification report for model 8, adaboost classifier, is: 
[[5851 2928]
[ 891 8893]]

         precision    recall  f1-score   support

     -1       0.87      0.67      0.75      8779
      1       0.75      0.91      0.82      9784
avg / total       0.81      0.79      0.79     18563

Area under curve (auc) of model8 is:  0.787704885721

Classification report for model 9, gradient boosting classifier, is: 
[[5905 2874]
[ 918 8866]]

         precision    recall  f1-score   support

     -1       0.87      0.67      0.76      8779
      1       0.76      0.91      0.82      9784
avg / total       0.81      0.80      0.79     18563

Area under curve (auc) of model9 is:  0.789400603089

Classification report for model 10, K neighbors classifier, is: 
[[6467 2312]
[1666 8118]]

         precision    recall  f1-score   support

     -1       0.80      0.74      0.76      8779
      1       0.78      0.83      0.80      9784

avg / total       0.79      0.79      0.79     18563

Area under curve (auc) of model10 is:  0.783183129908

when you didn't use cross-validation did you sample your data before splitting in training and testdata? — pythonic833
@pythonic833 Could you explain what you mean by sample data before splitting in training and testdata? Were you referring to stratified sampling? From the original dataset which contains everything, I split that data into 80% training and 20% testing. Was this the right thing to do? — Stanleyrr
Did you sample/rearrange your data before splitting it in training and test data? Sometimes outliers might be at the beginning or at the end of your dataset. Sampling should prevent such clusters of outliers — pythonic833
@Stanleyrr try setting cv=ms.StratifiedKFold(n_splits=5, shuffle=True) in your cross_val_score and see if it makes a difference. My understanding is that train_test_split will sample randomly within the classes but cross_val_score will not (by default). — Stev
No problem, I thought that might be the issue because I have had the same problem before :) I've added my comment as an answer, feel free to accept. — Stev

Stev Stev · Accepted Answer · 2018-04-09T08:09:00

Try setting cv=StratifiedKFold(n_splits=5, shuffle=True) in your cross_val_score and see if it makes a difference. My understanding is that train_test_split will sample randomly within the classes but cross_val_score will not (by default).

You can import stratified kfold using from sklearn.model_selection import StratifiedKFold

Accuracy ranking of K-fold cross validation doesn't agree with accuracy ranking of individual model

1 Answers