0
votes

I am working in a model for a competition, we were provided with 2 datasets:

Dataset A: Does contain the label, to be used to train/test the model. Dataset B: Does not contain the label, this data is to be used in a blind test, and based in the predictions a score is assigned.

I am ready with the model, however when using the function predict() with the Dataset B (for the blind test) one question came up, Do I have to apply the same pre-processing steps (remove duplicates, NAs, Scale Numeric Features) applied in the Dataset A? And what about the NAs? Looking in the Dataset B several NAs were included.

Thanks in advance for your help.

2
Yes, I think you should apply the same pre-processing steps. As for those NA values, if a given column have only a handful of NA, one quick fix would be to just replace them with the columns mean or median.Tim Biegeleisen

2 Answers

0
votes

I think I would have to apply the same pre-processing applied to data set A, duplicates, remove NA, Scale Numeric Features. For predictions could be affected. Dame puntos amigo.

0
votes

When you use the predict function you will need to clean your data.You can use the completecases() function if you want to get rid of all your NAs. You shouldn't remove duplicates unless you have a record number or a unique key.

datasetb.2<-datasetb[completecases(datasetb), ]
predicted<-predict(datasetA.model, newdata = datasetb.2)
accuracy<-(actual==predicted)/nrow(datasetb.2)