From train test split to cross validation in sklearn using pipeline

Question

I have the following piece of code:

from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.pipeline import Pipeline
...
x_train, x_test, y_train, y_test= model_selection.train_test_split(dataframe[features_],dataframe[labels], test_size=0.30,random_state=42, shuffle=True)
classifier = RandomForestClassifier(n_estimators=11)
pipe = Pipeline([('feats', feature), ('clf', classifier)])
pipe.fit(x_train, y_train)
predicts = pipe.predict(x_test)

Instead of train test split, I want to use k-fold cross validation to train my model. However, I do not know how can make it by using pipeline structure. I came across this: https://scikit-learn.org/stable/modules/compose.html but I could not fit to my code.

I want to use from sklearn.model_selection import StratifiedKFold if possible. I can use it without pipeline structure but I can not use it with pipeline.

Update: I tried this but it generates me error.

x_train = dataframe[features_]
y_train = dataframe[labels]

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) 
classifier = RandomForestClassifier(n_estimators=11)
     
#pipe = Pipeline([('feats', feature), ('clf', classifier)])
#pipe.fit(x_train, y_train)
#predicts = pipe.predict(x_test)

predicts = cross_val_predict(classifier, x_train , y_train , cv=skf)

Antoine Dubuis Antoine Dubuis · Accepted Answer · 2021-06-13T08:54:38

Pipeline is used to assemble several steps such as preprocessing, transformations, and modeling. StratifiedKFold is used to split your dataset to assess the performance of your model. It is not meant to be used as a part of the Pipeline as you do not want to perform it on new data.

Therefore it is normal to perform it out of the pipeline's structure.

From train test split to cross validation in sklearn using pipeline

1 Answers