How to predict on the Test data using random forest when "prediction" column is missing in the data set given?

Question

How to predict on the Test data using randomforest when "prediction" column (is_promoted) is missing in the TEST data set given?

Here I have given two data sets: Train and Test, in Test data set I have to predict whether the employee will be promoted or not.

The Train data set has the is_promoted column which has been used to build the model. and I have used Test$is_promoted=NA to add the is_promoted column in my Test data set so that I have equal dimensions during data preparation process.

But when I am using Random forest to predict the final values it shows those "NA" as missing value errors.

set.seed(123)
rf_m3=randomForest(is_promoted~.,data = FinalTest,ntree=150, nodesize=50, mtry=5)
rf_test_pred=predict(rf_m3, FinalTest, type="class")

Error code:

Error in na.fail.default(list(is_promoted = c(NA_integer_, NA_integer_,  : 
  missing values in object

Now I can't remove "is_promoted" also as its my dependent variable.

So kindly suggest a way to handle this issue and the modification of the code required.

PS: New learner. First time trying random forest, so if possible please explain as much as possible.

Random forests is a supervised machine learning method, which builds a model around a series of predictors and a known response. If you truly have no information on whether or not the subjects received a promotion, then you can't really use random forests. — Tim Biegeleisen
The point of a test set is exactly that you don't have the variable you want to predict. So it's perfectly normal that is_promoted is missing. You shouldn't create such a column when using predict. — nicola

Bappa Das Bappa Das · Accepted Answer · 2019-10-16T13:54:21

I think your dependent variable contains NAs, that's why the error is coming. You can check it by summary(FinalTest). If the is_promoted contains NA values (which I think is present) use

rf_m3 = randomForest(is_promoted~., data = FinalTest, ntree=150, nodesize=50, mtry=5, na.action=na.omit)

How to predict on the Test data using random forest when "prediction" column is missing in the data set given?

1 Answers