1
votes

I've been working on this classification problem using sklearn's Pipeline to combine the preprocessing step (scaling) and the cross validation step (GridSearchCV) using Logistic Regression.

Here is the simplified code:

# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler   

# scaler and encoder options
scaler = StandardScaler()   # there are 3 options that I want to try
encoder = OneHotEncoder()   # only one option, no need to GridSearch it

# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical', encoder, cat_columns),
                                                 ('numerical', scaler, num_columns),
                                                ])

# combine the preprocessor with LogisticRegression() using Pipeline 
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                  ('log_reg', LogisticRegression())])

What I'm trying to do, is to try different scaling methods (e.g. standard scaling, robust scaling, etc.) and after trying all of those, pick the scaling method that yields the best metric (i.e. accuracy). However, I don't know how to do this using the GridSearchCV:

from sklearn.model_selection import GridSearchCV

# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(), RobustScaler(), MinMaxScaler()]}

# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options, cv = 5)

# fit the data 
grid_cv.fit(X_train, y_train)

I know that the code above won't work, particularly because of the scaler_options that I've set as param_grid. I realize that the scaler_options I set can't be processed by GridSearchCV. Why? Because it isn't a hyperparameter of the pipeline (unlike 'log_reg__C', a hyperparameter from LogisticRegression() than can be accessed by the GridSearchCV). But instead its a component of the ColumnTransformer which I have nested inside the full_pipeline.

So the main question is, how do I automate GridSearchCV to test all of my scaler options? Since the scaler is a component of a sub-pipeline (i.e. the previous ColumnTransformer).

2
Update: I think I’ve found the solution, which is to create a custom transformer with a class that has the “scaling_options” as its initialization parameter to choose which scaling method I want to apply. That way I can insert the following dictionary {preprocessor__customtransformer__scaling_options: [list of options]} as the param_grid. Correct me if I’m wrong. - imavv
Please edit your question to add some explanation or code instead of using comments as you did. - help-info.de

2 Answers

2
votes

As you suggested you could create a class that takes in its __init()__ parameters, the scaler you want to use.
Then you could specify in your grid search parameters the Scaler your class should use to initialize the class.

I wrote that i hope it helps :

class ScalerSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, scaler=StandardScaler()):
        super().__init__()
        self.scaler = scaler

    def fit(self, X, y=None):
        return self.scaler.fit(X)

    def transform(self, X, y=None):
        return self.scaler.transform(X)

Here you can find a full example that you can run to test :

# import dependencies
from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler   
from sklearn.datasets import load_breast_cancer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler   

import pandas as pd

class ScalerSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, scaler=StandardScaler()):
        super().__init__()
        self.scaler = scaler

    def fit(self, X, y=None):
        return self.scaler.fit(X)

    def transform(self, X, y=None):
        return self.scaler.transform(X)


data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'], columns=data['feature_names'])
col_names = data.columns.tolist()

# scaler and encoder options
my_scaler = ScalerSelector()

preprocessor = ColumnTransformer(transformers = [('numerical', my_scaler, col_names)
                                                ])

# combine the preprocessor with LogisticRegression() using Pipeline 
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                  ('log_reg', LogisticRegression())
                                  ])

# set params combination I want to try
scaler_options = {'preprocessor__numerical__scaler':[StandardScaler(), RobustScaler(), MinMaxScaler()]}

# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options)

# fit the data 
grid_cv.fit(data, target)

# best params :
grid_cv.best_params_
0
votes

You can fulfill what you intend without creating a custom transformer. And you can even pass the 'passthrough' argument into param_grid to experiment with the scenario where you don't want to do any scaling in that step at all.

In this example, suppose we want to investigate whether it is better for the model to impose a Scaler transformer on numerical features, num_features.

cat_features = selector(dtype_exclude='number')(train.drop('target', axis=1))
num_features = selector(dtype_include='number')(train.drop('target', axis=1))

cat_preprocessor = Pipeline(steps=[
    ('oh', OneHotEncoder(handle_unknown='ignore')),
    ('ss', StandardScaler()) 
])
num_preprocessor = Pipeline(steps=[ 
    ('pt', PowerTransformer(method='yeo-johnson')),
    ('ss', StandardScaler()) # Create a place holder for your test here !!!                                   
]) 
preprocessor = ColumnTransformer(transformers=[ 
    ('cat', cat_preprocessor, cat_features),
    ('num', num_preprocessor, num_features)                                                       
])
model = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', RidgeClassifier())
])
X = train.drop('target', axis=1)
y = train['target']
param_grid = {
    'prep__cat__ss': ['passthrough', StandardScaler(with_mean=False)] # 'passthrough', 
}
gs = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=-1,
    cv=2
)
gs.fit(X, y)