I've been working on this classification problem using sklearn's Pipeline to combine the preprocessing step (scaling) and the cross validation step (GridSearchCV) using Logistic Regression.
Here is the simplified code:
# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
# scaler and encoder options
scaler = StandardScaler() # there are 3 options that I want to try
encoder = OneHotEncoder() # only one option, no need to GridSearch it
# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical', encoder, cat_columns),
('numerical', scaler, num_columns),
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())])
What I'm trying to do, is to try different scaling methods (e.g. standard scaling, robust scaling, etc.) and after trying all of those, pick the scaling method that yields the best metric (i.e. accuracy). However, I don't know how to do this using the GridSearchCV:
from sklearn.model_selection import GridSearchCV
# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options, cv = 5)
# fit the data
grid_cv.fit(X_train, y_train)
I know that the code above won't work, particularly because of the scaler_options that I've set as param_grid. I realize that the scaler_options I set can't be processed by GridSearchCV. Why? Because it isn't a hyperparameter of the pipeline (unlike 'log_reg__C', a hyperparameter from LogisticRegression() than can be accessed by the GridSearchCV). But instead its a component of the ColumnTransformer which I have nested inside the full_pipeline.
So the main question is, how do I automate GridSearchCV to test all of my scaler options? Since the scaler is a component of a sub-pipeline (i.e. the previous ColumnTransformer).