Assume that I have 3 dataset in a ML problem.
train dataset
: used to estimate ML model parameters (training)
test dataset
: used to evaulate trained model, calculate accuracy of trained model
prediction dataset
: used only for prediction after model deployment
I don't have evaluation dataset
, and I use Grid Search with k-fold cross validation to find the best model.
Also, I have two python scripts as follows:
train.py
: used to train and test ML model, load train and test dataset, save the trained model, best model is found by Grid Search.
predict.py
: used to load pre-trained model & load prediction dataset, predict model output and calculate accuracy.
Before starting training process in train.py
, I use MinMaxScaler as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_train) # fit only on train dataset
x_train_norm = scaler.transform(x_train)
x_test_norm = scaler.transform(x_test)
In predict.py
, after loding prediction dataset, I need to use the same data pre-processing as below:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_predict)
x_predict_norm = scaler.transform(x_predict)
As you can see above, both fit and transform are done on prediction dataset. However, in train.py
, fit is done on train dataset, and the same MinMaxScaler is applied to transform test dataset.
My understanding is that test dataset is a simulation of real data that model is supposed to predict after deployment. Therefore, data pre-processing of test and prediction dataset should be the same.
I think separate MinMaxScaler should be used in train.py
for train and test dataset as follows:
from sklearn.preprocessing import MinMaxScaler
scaler_train = MinMaxScaler()
scaler_test = MinMaxScaler()
scaler_train.fit(x_train) # fit only on train dataset
x_train_norm = scaler_train.transform(x_train)
scaler_test.fit(x_test) # fit only on test dataset
x_test_norm = scaler_test.transform(x_test)
What is the difference?
Value of x_test_norm
will be different if I use separate MinMaxScaler as explained above. In this case, value of x_test_norm
is in the range of [-1, 1]. However, If I transform test dataset by a MinMaxScaler which was fit by train dataset, value of x_test_norm
can be outside the range of [-1, 1].
Please let me know your idea about it.