R Caret Package error imputing data with Pre-Process function

Question

I have a dataset (training - testing) with missing data and I would like to impute data before the classification.

I tried using the caret package and the function preProcess, I want to impute data using the predictor variable for the training set and impute data on the testing set only using the knowledge of the trainingset without using the predictor of the testing set (that I should not know).

p = preProcess(x = training, method = "knnImpute", k = 10)
pred = predict(object = p, newdata = training)
pred1 = predict(object = p, newdata = testing)

when I run this code, I have this error on the second line

Error in FUN(newX[, i], ...) : 
  cannot impute when all predictors are missing in the new data point

I also tried to remove the predictor variable in the training set but the result is the same. I tried using the Iris dataset, removing some value in each column and removing the predictor and it works...but the datasets are with the same characteristics, both data.frame and both only with numeric values.

Providing a reproducible example and the results of sessionInfo will help get your question answered. — topepo

desertnaut desertnaut · Accepted Answer · 2015-04-07T12:57:35

From your words ("without using the predictor of the testing set (that I should not know)"), I conclude that by "predictor" you mean the target variable - which is by itself a mistake. "Predictors" are the known features, from which we wish to predict the target variable...

If I am correct, you are actually trying to predict the target variable using missing values imputation, which is again a mistake, and not the purpose of missing value imputation. The correct use is when you have some (but not all) values missing from your predictors (features), and you want to imputate them in order, say, to be used as input to some ML algorithm which does not tolerate missing values.

R Caret Package error imputing data with Pre-Process function

3 Answers