sklearn pipelines with fit_transfrom or predict objects instead of fit objects

Question

This example on sklearn website and this answer to sklearn pipelines on SO uses and talks only about using .fit() or .fit_transform() methods in Pipleines.

But, how do I use .predict or .transfrom methods in Pipelines. let's say I have pre-processed my train data, searched for best hyper-parameters and trained an LightGBM model. I would now like to predict on new data, instead of doing all the aforementioned things manually, I want to do them all one-after-one, according to the definition:

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

But, I only want to implement .transform methods on my validation(or test) data and some more functions(or classes) that take pandas Series(or DataFrame or numpy array) and return processed one, then finally implement .predict method of my LightGBM, which would use the hyper-parameters I already have.

I currently have nothing, since I don't know how to include methods of classes properly( like StandardScaler_instance.transform()) and more such methods.!

How do I do this or what have I missed?

Kim Tang Kim Tang · Accepted Answer · 2020-09-17T08:47:50

You have to build your pipeline, which will include the LightGBM model and train the pipeline on your (pre-processed) train data.

With code, it could look like this for example:

import lightgbm
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create some train and test data
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Define pipeline with scaler and lightgbm model
pipe = Pipeline([('scaler', StandardScaler()), ('lightgbm', lightgbm.LGBMClassifier())])

# Train pipeline
pipe.fit(X_train, y_train)

# Make predictions with pipeline (with lightgbm)
print("Predictions:", pipe.predict(X_test))

# Evaluate pipeline performance
print("Performance score:", pipe.score(X_test, y_test))

Output:

Predictions: [1 0 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0]
Performance score: 0.84

So to answer your questions:

But, how do I use .predict or .transfrom methods in Pipelines.

You don't have to use .transform, as the pipeline handles the transforms of your input data with the supplied transformers automatically. That's why in the documentation it mentions:

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.

You can use .predict as shown in the code example with your test data.

Instead of the StandardScaler I used in this example, you can provide the pipeline with your custom transformer, but it has to implement a .transform() and .fit() method the pipeline can call and the output of the transformer needs to match the required input of the lightgbm model.

Update

You can then provide arguments for different steps of the pipeline as explained in the documentation here:

**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

sklearn pipelines with fit_transfrom or predict objects instead of fit objects

1 Answers