2
votes

Reading scikit-learn doc on Pipeline, all the examples apply the transformers on the entire dataset (e.g. StandardScaler, PCA).

Is it possible to, say, only scale a specific variable in the dataset? If this is possible, then I can put my entire feature engineering process into a Pipeline and apply it on both my train and test sets.

1

1 Answers

3
votes

You can use a combination of FeatureUnion and custom transformers that take only the variable you're interested in.

However, you're right in that sklearn does not handle heterogeneous feature sets particularly well. There is a library sklearn-pandas which makes it a lot easier, letting you define separate pipelines for specific columns of a pandas dataframe.