I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently perform better as shown below. Am I over-fitting on the training data? And if so which hyper parameter(s) would be best to modify to avoid this?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
which gives me:
0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914
#re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1]
auc = roc_auc_score(y_test, y_pred_proba)
print(accuracy,
precision,
recall,
f1,
auc
)
Now gives me the numbers below, which are clearly worse:
0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368
training dataset, while the second model's score is calculate on thetesting dataset. - Trường Thuận Nguyễn