16
votes

I am using Sklearn to build a linear regression model (or any other model) with the following steps:

X_train and Y_train are the training data

  1. Standardize the training data

      X_train = preprocessing.scale(X_train)
    
  2. fit the model

     model.fit(X_train, Y_train)
    

Once the model is fit with scaled data, how can I predict with new data (either one or more data points at a time) using the fit model?

What I am using is

  1. Scale the data

    NewData_Scaled = preprocessing.scale(NewData)
    
  2. Predict the data

    PredictedTarget = model.predict(NewData_Scaled)
    

I think I am missing a transformation function with preprocessing.scale so that I can save it with the trained model and then apply it on the new unseen data? any help please.

2

2 Answers

29
votes

Take a look at these docs.

You can use the StandardScaler class of the preprocessing module to remember the scaling of your training data so you can apply it to future values.

from sklearn.preprocessing import StandardScaler
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)

scaler has calculated the mean and scaling factor to standardize each feature.

>>>scaler.mean_
array([ 1. ...,  0. ...,  0.33...])
>>>scaler.scale_                                       
array([ 0.81...,  0.81...,  1.24...])

To apply it to a dataset:

import numpy as np

X_train_scaled = scaler.transform(X_train)
new_data = np.array([-1.,  1., 0.])    
new_data_scaled = scaler.transform(new_data)
>>>new_data_scaled
array([[-2.44...,  1.22..., -0.26...]])
1
votes

Above answer is OK when you have use train data and test data in single run...
But what if you want to test or infer after training

This will surely help

from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data 

sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
#On new data, though data count is one but Features count is still Four
sc.transform(np.array([[6.5, 1.5, 2.5, 6.5]]))  # to verify the last returned output



std  = np.sqrt(sc.var_)
np.save('std.npy',std )
np.save('mean.npy',sc.mean_)

This block is independent

s = np.load('std.npy')
m = np.load('mean.npy')
(np.array([[6.5, 1.5, 2.5, 6.5]] - m)) / s   # z = (x - u) / s ---> Main formula
# will have same output as above