It seems to be based on how samples are grouped and predicted. From the user guide linked in the cross_val_predict docs:
Warning Note on inappropriate usage of cross_val_predict
The result of
cross_val_predict may be different from those obtained using
cross_val_score as the elements are grouped in different ways. The
function cross_val_score takes an average over cross-validation folds,
whereas cross_val_predict simply returns the labels (or probabilities)
from several distinct models undistinguished. Thus, cross_val_predict
is not an appropriate measure of generalisation error.
The cross_val_score seems to say that it averages across all of the folds, while the cross_val_predict groups individual folds and distinct models but not all and therefore it won't necessarily generalize as well. For example, using the sample code from the sklearn page:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
diabetes = datasets.load_diabetes()
X = diabetes.data[:200]
y = diabetes.target[:200]
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=3)
print("Cross Val Prediction score:{}".format(mean_squared_error(y,y_pred)))
print("Cross Val Score:{}".format(np.mean(cross_val_score(lasso, X, y, cv=3, scoring = make_scorer(mean_squared_error)))))
Cross Val Prediction score:3993.771257795029
Cross Val Score:3997.1789145156217