Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

Question

I have 30 factor levels of a predictor in my training data. I again have 30 factor levels of the same predictor in my test data but some levels are different. And randomForest does not predict unless the levels are same exactly. It shows error. Says, Error in predict.randomForest(model,test) New factor levels not present in the training data

Exactly how is it supposed to construct a prediction for something it has not 'seen' before? — IRTFM
A common workaround is to group the infrequent levels of the factor into an "other" level, that will also contain unobserved values. Another idea (if there are many variables) would be to discard the trees that use a variable with an unobserved value. — Vincent Zoonekynd

Tommy Levi Tommy Levi · Accepted Answer · 2013-06-12T16:58:06

One workaround I've found is to first convert the factor variables in your train and test sets into characters

test$factor <- as.character(test$factor)

Then add a column to each with a flag for test/train, i.e.

test$isTest <- rep(1,nrow(test))
train$isTest <- rep(0,nrow(train))

Then rbind them

fullSet <- rbind(test,train)

Then convert back to a factor

fullSet$factor <- as.factor(fullSet$factor)

This will ensure that both the test and train sets have the same levels. Then you can split back off:

test.new <- fullSet[fullSet$isTest==1,]
train.new <- fullSet[fullSet$isTest==0,]

and you can drop/NULL out the isTest column from each. Then you'll have sets with identical levels you can train and test on. There might be a more elegant solution, but this has worked for me in the past and you can write it into a little function if you need to repeat it often.

Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

4 Answers