1
votes

I have been working on a machine learning model and I'm currently using a Pipeline with GridSearchCV. My data is scaled with MinMaxScaler and I'm using an SVR with RBR kernel. My question is now that my model is complete, fitted, and has a decent evaluation score, do I need to also scale new data for predictions with MinMaxScaler or can I just make predictions with the data as is? I've read 3 books on scikit learn but they all focus on feature engineering and fitting. They don't really cover any additional steps in the prediction step other than use the predict method.

This is the code:

pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())]) 
time_split = TimeSeriesSplit(n_splits=5) 

param_grid = {'clf__kernel': ['rbf'], 
              'clf__C':[0.0001, 0.001], 
              'clf__gamma': [0.0001, 0.001]} 

grid = GridSearchCV(pipe, param_grid, cv= time_split, 
                    scoring='neg_mean_squared_error', n_jobs = -1) 
grid.fit(X_train, y_train) 
2
You can save the complete GridSearchCV or the best estimator found from it to a file using pickle or joblib and then load that at the time of prediction.Vivek Kumar

2 Answers

3
votes

Sure, if you get new (in the sense of unprocessed) data you need to do the same preparation steps as you did when training the model. For example if you use MinMaxScaler with default proporties the model is used to have data with zero mean and standard variance in each feature, if you don't preprocess data the model can't produce accurate results.

Keep in mind to use exactly the same MinMaxScaler object you used for the training data. So in case you save your model to a file, save also your preprocessing objects.

0
votes

I wanted to follow up my question with a solution thanks to pythonic833's answer. I think the proper procedure to scale new data for prediction if you used a pipeline is to do the whole scaling process from beginning to end with the original training data that was used on the pipeline. Even though the pipeline did the scaling for you during the training process, it's necessary to scale the training data manually to be able to have the new data predict accurately and scaled correctly by having a MinMaxScaler object. Below is my code based on pythonic833 answer and some of the other comments such as saving the model with Pickle.

from sklearn.preprocessing import MinMaxScaler

pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())]) 
time_split = TimeSeriesSplit(n_splits=5) 
param_grid = {'clf__kernel': ['rbf'], 
          'clf__C':[0.0001, 0.001], 
          'clf__gamma': [0.0001, 0.001]} 

grid = GridSearchCV(pipe, param_grid, cv= time_split, 
       scoring='neg_mean_squared_error', n_jobs = -1) 
grid.fit(X_train, y_train)

# Pickle the data with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'wb') as file:
    pickle.dump(grid, file)

# Load Pickle with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'rb') as file:
    model = pickle.load(file)

scaler = MinMaxScaler()
scaler.fit(X_train)  # Original training data for Pipeline
X_train_scaled = scaler.transform(X_train)
new_data_scaled = scaler.transform(new_data)
model.predict(new_data_scaled)