Using custom Pipeline for Cross Validation scikit-learn

Question

I would like to be use GridSearchCV to determine the parameters of a classifier, and using pipelines seems like a good option.

The application will be for image classification using Bag-of-Word features, but the issue is that there is a different logical pipeline depending on whether training or test examples are used.

For each training set, KMeans must run to produce a vocabulary that will be used for testing, but for test data no KMeans process is run.

I cannot see how it is possible to specify this difference in behavior for a pipeline.

ogrisel ogrisel · Accepted Answer · 2012-10-24T20:53:42

You probably need to derive from the KMeans class and override the following methods to use your vocabulary logic:

fit_transform will only be called on the train data
transform will be called on the test data

Maybe class derivation is not alway the best option. You can also write your own transformer class that wraps calls to an embedded KMeans model and provides the fit / fit_transform / transform API that is expected by the Pipeline class for the first stages.

Using custom Pipeline for Cross Validation scikit-learn

1 Answers