0
votes

How to predict on the Test data using randomforest when "prediction" column (is_promoted) is missing in the TEST data set given?

Here I have given two data sets: Train and Test, in Test data set I have to predict whether the employee will be promoted or not.

The Train data set has the is_promoted column which has been used to build the model. and I have used Test$is_promoted=NA to add the is_promoted column in my Test data set so that I have equal dimensions during data preparation process.

But when I am using Random forest to predict the final values it shows those "NA" as missing value errors.

set.seed(123)
rf_m3=randomForest(is_promoted~.,data = FinalTest,ntree=150, nodesize=50, mtry=5)
rf_test_pred=predict(rf_m3, FinalTest, type="class")

Error code:

Error in na.fail.default(list(is_promoted = c(NA_integer_, NA_integer_,  : 
  missing values in object

Now I can't remove "is_promoted" also as its my dependent variable.

So kindly suggest a way to handle this issue and the modification of the code required.

PS: New learner. First time trying random forest, so if possible please explain as much as possible.

1
Random forests is a supervised machine learning method, which builds a model around a series of predictors and a known response. If you truly have no information on whether or not the subjects received a promotion, then you can't really use random forests.Tim Biegeleisen
The point of a test set is exactly that you don't have the variable you want to predict. So it's perfectly normal that is_promoted is missing. You shouldn't create such a column when using predict.nicola

1 Answers

0
votes

I think your dependent variable contains NAs, that's why the error is coming. You can check it by summary(FinalTest). If the is_promoted contains NA values (which I think is present) use

rf_m3 = randomForest(is_promoted~., data = FinalTest, ntree=150, nodesize=50, mtry=5, na.action=na.omit)