Scaling Features For Prediction in Scikit Learn

Question

I have been working on a machine learning model and I'm currently using a Pipeline with GridSearchCV. My data is scaled with MinMaxScaler and I'm using an SVR with RBR kernel. My question is now that my model is complete, fitted, and has a decent evaluation score, do I need to also scale new data for predictions with MinMaxScaler or can I just make predictions with the data as is? I've read 3 books on scikit learn but they all focus on feature engineering and fitting. They don't really cover any additional steps in the prediction step other than use the predict method.

This is the code:

pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())]) 
time_split = TimeSeriesSplit(n_splits=5) 

param_grid = {'clf__kernel': ['rbf'], 
              'clf__C':[0.0001, 0.001], 
              'clf__gamma': [0.0001, 0.001]} 

grid = GridSearchCV(pipe, param_grid, cv= time_split, 
                    scoring='neg_mean_squared_error', n_jobs = -1) 
grid.fit(X_train, y_train)

You can save the complete GridSearchCV or the best estimator found from it to a file using pickle or joblib and then load that at the time of prediction. — Vivek Kumar

pythonic833 pythonic833 · Accepted Answer · 2018-03-28T19:58:52

Sure, if you get new (in the sense of unprocessed) data you need to do the same preparation steps as you did when training the model. For example if you use MinMaxScaler with default proporties the model is used to have data with zero mean and standard variance in each feature, if you don't preprocess data the model can't produce accurate results.

Keep in mind to use exactly the same MinMaxScaler object you used for the training data. So in case you save your model to a file, save also your preprocessing objects.

Scaling Features For Prediction in Scikit Learn

2 Answers