0
votes

Assume that I have 3 dataset in a ML problem.

train dataset: used to estimate ML model parameters (training)

test dataset: used to evaulate trained model, calculate accuracy of trained model

prediction dataset: used only for prediction after model deployment

I don't have evaluation dataset, and I use Grid Search with k-fold cross validation to find the best model.

Also, I have two python scripts as follows:

train.py: used to train and test ML model, load train and test dataset, save the trained model, best model is found by Grid Search.

predict.py: used to load pre-trained model & load prediction dataset, predict model output and calculate accuracy.

Before starting training process in train.py, I use MinMaxScaler as follows:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(x_train)   # fit only on train dataset
x_train_norm = scaler.transform(x_train)
x_test_norm = scaler.transform(x_test)

In predict.py, after loding prediction dataset, I need to use the same data pre-processing as below:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(x_predict)   
x_predict_norm = scaler.transform(x_predict)

As you can see above, both fit and transform are done on prediction dataset. However, in train.py, fit is done on train dataset, and the same MinMaxScaler is applied to transform test dataset.

My understanding is that test dataset is a simulation of real data that model is supposed to predict after deployment. Therefore, data pre-processing of test and prediction dataset should be the same.

I think separate MinMaxScaler should be used in train.py for train and test dataset as follows:

from sklearn.preprocessing import MinMaxScaler
scaler_train = MinMaxScaler()
scaler_test = MinMaxScaler()

scaler_train.fit(x_train)   # fit only on train dataset
x_train_norm = scaler_train.transform(x_train)

scaler_test.fit(x_test)   # fit only on test dataset
x_test_norm = scaler_test.transform(x_test)

What is the difference?

Value of x_test_norm will be different if I use separate MinMaxScaler as explained above. In this case, value of x_test_norm is in the range of [-1, 1]. However, If I transform test dataset by a MinMaxScaler which was fit by train dataset, value of x_test_norm can be outside the range of [-1, 1].

Please let me know your idea about it.

1

1 Answers

3
votes

When you run .transform() MinMax scaling does something like: (value - min) / (Max - min) The value of min and Max are defined when you run .fit(). So the answer - yes, you should fit MinMaxScaller on the training dataset and then use it on the test dataset.

Just imagine the situation when in the training dataset you have some feature with Max=100 and min=10, while in the test dataset Max=10 and min=1. If you will train separate MinMaxScaller for test subset, yes, it will scale the feature in the range [-1, 1], but in comparison to the training dataset, the called values should be lower.

Also, regarding Grid Search with k-fold cross-validation, you should use the Pipeline. In this case, Grid Search will automatically fit MinMaxScaller on the k-1 folds. Here is a good example of how to organize pipeline with Mixed Types.