0
votes

I have a dataset (training - testing) with missing data and I would like to impute data before the classification.

I tried using the caret package and the function preProcess, I want to impute data using the predictor variable for the training set and impute data on the testing set only using the knowledge of the trainingset without using the predictor of the testing set (that I should not know).

p = preProcess(x = training, method = "knnImpute", k = 10)
pred = predict(object = p, newdata = training)
pred1 = predict(object = p, newdata = testing)

when I run this code, I have this error on the second line

Error in FUN(newX[, i], ...) : 
  cannot impute when all predictors are missing in the new data point

I also tried to remove the predictor variable in the training set but the result is the same. I tried using the Iris dataset, removing some value in each column and removing the predictor and it works...but the datasets are with the same characteristics, both data.frame and both only with numeric values.

3
Providing a reproducible example and the results of sessionInfo will help get your question answered. - topepo

3 Answers

1
votes

From your words ("without using the predictor of the testing set (that I should not know)"), I conclude that by "predictor" you mean the target variable - which is by itself a mistake. "Predictors" are the known features, from which we wish to predict the target variable...

If I am correct, you are actually trying to predict the target variable using missing values imputation, which is again a mistake, and not the purpose of missing value imputation. The correct use is when you have some (but not all) values missing from your predictors (features), and you want to imputate them in order, say, to be used as input to some ML algorithm which does not tolerate missing values.

0
votes

I also faced the same error and have worked it out that the data set that you are imputing i.e. training, was created using createDataPartition by splitting into training and testing sets. Imputing works fine if you apply it to the original set before the split.

0
votes

I had the same problem. I traced the problem to using the target column as the trainRowNum variable for the createDataPartition. When I did that, It raised the error

Error in quantile.default(y, probs = seq(0, 1, length = groups)): missing values and NaN's not allowed if 'na.rm' is FALSE

and proceeding further with knnImpute and Predict gave the following error

Error in FUN(newX[, i], ...) : cannot impute when all predictors are missing in the new data point

So instead of using the target column, I created an Index variable

x$Index <- as.numeric(rownames(x))

and used the Index column for data partitioning as train dataset. It worked well and no error. The Index column can later be removed from the train dataset for further computation. I think that using column with missing variables for data partition leads to this kind of problem.