How to take a sklearn post-cross_val_predict model to do prediction on another scaled data set? And whether the model can be serialized?

Question

I came across this question while on a sklearn ML case with heavily imbalanced data. The line below provides the basis for assessing the model from confusion metrics and precision-recall perspectives but ... it is a train/predict combined method:

y_pred = model_selection.cross_val_predict(model, X, Y, cv=kfold)

The question is how do I leverage this 'cross-val-trained' model to:

1) predict on another data set (scaled) instead of having to train/predict each time?

2) export/serialize/deploy the model to predict on live data?

model.predict() #--> nope.  need a fit() first

model.fit() #--> nope.  a different model which does not take advantage of the cross_val_xxx methods

Any help is appreciated.

Matthieu Brucher Matthieu Brucher · Accepted Answer · 2018-12-02T16:49:39

You can fit a new model with the data.

The cross validation aspect is about validating the way the model is built, not the model itself. So if the cross validation is OK, then you can train a new model with all the data.

(See my response here as well for more details Fitting sklearn GridSearchCV model)

How to take a sklearn post-cross_val_predict model to do prediction on another scaled data set? And whether the model can be serialized?

1 Answers