0
votes

I noticed that predict() will only create predictions on complete cases. I had included medianImpute in the preProcess options, such as the following:

train(outcome ~ ., 
        data = df,
        method = "rf", 
        tuneLength = 5,
        preProcess = c("YeoJohnson", "center", "scale", "medianImpute"),
        metric = 'ROC', 
        trControl = train_ctrl)
}

Does this mean that I should be doing imputation for the missing values before training the set? If not, I am unable to create a prediction for all cases in the test set. I had read in Dr. Kuhn's book that pre-processing should occur during cross validation... Thanks!

1

1 Answers

5
votes

If you are using medianImpute then it definitely needs to happen before the training set otherwise even if you impute the test set with medianImpute the results would be wrong.

Take the following extreme case as an example:

You have only one independent variable X which constists of numbers 1 to 100. Imagine the extreme case of splitting the data set into a 50% test set and a 50% training set, with numbers 1 to 50 being in the test set and numbers 51 to 100 in the training set.

> median(1:50)  #test set median
[1] 25.5
> median(51:100) #training set median
[1] 75.5

Using your code (caret's train function) the missing values in the training set would be replaced with 75.5. This will create three major problems:

  1. You can not use the same method (medianImpute) for the test set because missing values in the test set would be replaced with 25.5
  2. You can not manually replace the missing values in the test set with 75.5 because the value of 75.5 is much higher than the max value of the test set and you would dramatically skew it.
  3. The function train of the caret package, will try to find out the best parameters for your model (tuning). Replacing the missing values with 75.5 when the full data set's median (the correct value for imputing missing data) is 50.5 would tune the model with the wrong parameter values.

Therefore, the best thing to do is to account for the missing data before the training set's creation.

Hope this helps!