Why is cross_val_predict not appropriate for measuring the generalisation error?

Question

When I train a SVC with cross validation,

y_pred = cross_val_predict(svc, X, y, cv=5, method='predict')

cross_val_predict returns one class prediction for each element in X, so that y_pred.shape = (1000,) when m=1000. This makes sense, since cv=5 and therefore the SVC was trained and validated 5 times on different parts of X. In each of the five validations, predictions were made for one fifth of the instances (m/5 = 200). Subsequently the 5 vectors, containing 200 predictions each, were merged to y_pred.

With all of this in mind it would be reasonable for me to calculate the overall accuracy of the SVC using y_pred and y.

score = accuracy_score(y, y_pred)

But (!) the documentation of cross_val_predict states:

The result of cross_val_predict may be different from those obtained using cross_val_score as the elements are grouped in different ways. The function cross_val_score takes an average over cross-validation folds, whereas cross_val_predict simply returns the labels (or probabilities) from several distinct models undistinguished. Thus, cross_val_predict is not an appropriate measure of generalisation error.

Could someone please explain in other words, why cross_val_predict is not appropriate for measuring the generalisation error e.g. via accuracy_score(y, y_pred)?

Edit:

I first assumed that with cv=5 in each of the 5 validations predicitons would be made for all instances of X. But this is wrong, predictions are only made for 1/5 of the instances of X per validation.

Szymon Maszke Szymon Maszke · Accepted Answer · 2019-03-05T19:58:49

cross_val_score vs cross_val_predict

Differences between cross_val_predict and cross_val_score are described really clearly here and there is another link in there, so you can follow the rabbit.

In essence:

cross_val_score returns score for each fold
cross_val_predict makes out of fold predictions for each data point.

Now, you have no way of knowing which predictions in cross_val_predict came from which fold, hence you cannot calculate average per fold as cross_val_score does. You could average cross_val_score and accuracy_score of cross_val_predict, but average of averages is not equal to average, hence results would be different.

If one fold has a very low accuracy, it would impact the overall average more than in the case of averaged cross_val_predict.

Furthermore, you could group those seven data points differently and get different results. That's why there is information about grouping making the difference.

Example of difference between cross_val_score and cross_val_predict

Let's imagine cross_val_predict uses 3 folds for 7 data points and out of fold predictions are [0,1,1,0,1,0,1], while true targets are [0,1,1,0,1,1,0]. Accuracy score would be calculated as 5/7 (only the last two were badly predicted).

Now take those same predictions and split them into following 3 folds:

[0, 1, 1] - prediction and [0, 1, 1] target -> accuracy of 1 for first fold
[0, 1] - prediction and [0, 1] target -> perfect accuracy again
[0, 1] - prediction and [1, 0] target -> 0 accuracy

This is what cross_val_score does and would return a tuple of accuracies, namely [1, 1, 0]. Now, you can average this tuple and total accuracy is 2/3.

See? With the same data, you would get two different measures of accuracy (one being 5/7 and the other 2/3).

In both cases, grouping changed total accuracy you would obtain. Classifier errors are more severe with cross_val_score, as each errors influences the group's accuracy more than it would influence the average accuracy of all predictions (you can check it on your own).

Both could be used for evaluating your model's performance on validation set though and I see no contraindication, just different behavior (fold errors not being as severe).

Why neither is a measure of generalization

If you fit your algorithm according to cross validation schemes, you are performing data leakage (fine-tuning it for the train and validation data). In order to get a sense of generalization error, you would have to leave a part of your data out of cross validation and training.

You may want to perform double cross validation or just leave test set out to get how well your model actually generalizes.

Why is cross_val_predict not appropriate for measuring the generalisation error?

1 Answers

cross_val_score vs cross_val_predict

Example of difference between cross_val_score and cross_val_predict

Why neither is a measure of generalization