Cross-validation in scikit-learn: mean absolute error of (X_test, y_test)

Question

Usually we split the original feature and target data (X,y) in (X_train, y_train) and (X_test, y_test).

By using the method:

mae_A = cross_val_score(clf, X_train_scaled, y_train, scoring="neg_mean_absolute_error", cv=kfold)

I get the cross validation Mean Absolute Error (MAE) for the (X_train, y_train), right?

So, how can I get the MAE (from the previous cross-validation models got by using (X_train, y_train)) for the (X_test, y_test)?

Thank you very much!

Maicon P. Lourenço

Usually, you don't do cross-validation for train and test separately. You do it on the whole data set. — DollarAkshay
If in cv=kfold instead of kfold you use an iterable yielding (train, test) splits as arrays of indices, your model will train on train indices and produce score for test indices. — Sergey Bushmanov

yatu yatu · Accepted Answer · 2019-01-14T16:51:20

This is the correct approach. As a rule, you should only train your model using training data. Thus the test_set should remain unseen in the cross-validation process, i.e. by the model's hyperparameters, otherwise you could be biasing the results obtained from the model by adding knowledge from the test sample.

I get the cross validation Mean Absolute Error (MAE) for the (X_train, y_train), right?

Yes, the error displayed by cross_val_score will be only from the training data. So the idea is that once you are satisfied with the results of cross_val_score, you fit the final model with the whole training set, and perform a prediction on y_test. For that you could use sklearn.metrics. For isntance, if you wanted to obtain the MAE:

from sklearn.metrics import mean_absolute_error as mae
accuracy = mae(y_test, y_pred)

Cross-validation in scikit-learn: mean absolute error of (X_test, y_test)

2 Answers