Why does my cross-validation consistently perform better than train-test split?

Question

I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently perform better as shown below. Am I over-fitting on the training data? And if so which hyper parameter(s) would be best to modify to avoid this?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)   
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

which gives me:

0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914

#re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()

RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1] 
auc = roc_auc_score(y_test, y_pred_proba)

print(accuracy,
      precision,
      recall,
      f1,
      auc
      )

Now gives me the numbers below, which are clearly worse:

0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368

Hi, I think it is not right to compare the result, because, the cross-validation score is the best score calculate in training dataset, while the second model's score is calculate on the testing dataset. — Trường Thuận Nguyễn

jdsurya jdsurya · Accepted Answer · 2021-11-14T13:36:29

I am able to reproduce your scenario with Pima Indians Diabetes Dataset.

The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa. While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.

In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:

1. Cross-validation step:

I use the whole of the X and y sets and run rest of the code as it is

rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

Output:

0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622

2. Classic train-test step:

Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).

accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    RFC = RandomForestClassifier()

    RFC.fit(X_train, y_train)
    y_pred = RFC.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    y_pred_proba = RFC.predict_proba(X_test)[::, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    accuracies.append(accuracy)
    precisions.append(precision)
    recalls.append(recall)
    f1s.append(f1)
    aucs.append(auc)

print(mean(accuracies),
      mean(precisions),
      mean(recalls),
      mean(f1s),
      mean(aucs)
      )

Output:

0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568

As expected the prediction metrics are similar. However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.

Why does my cross-validation consistently perform better than train-test split?

1 Answers

1. Cross-validation step:

2. Classic train-test step: