I am working in a model for a competition, we were provided with 2 datasets:
Dataset A: Does contain the label, to be used to train/test the model. Dataset B: Does not contain the label, this data is to be used in a blind test, and based in the predictions a score is assigned.
I am ready with the model, however when using the function predict()
with the Dataset B (for the blind test) one question came up, Do I have to apply the same pre-processing steps (remove duplicates, NAs, Scale Numeric Features) applied in the Dataset A? And what about the NAs? Looking in the Dataset B several NAs were included.
Thanks in advance for your help.
NA
values, if a given column have only a handful ofNA
, one quick fix would be to just replace them with the columns mean or median. – Tim Biegeleisen