How does randomForest() predict for new factor levels not in training data?

Question

When I create training set and test set by splitting a single data frame and build a random forest using randomForest package, for some factor levels which are not present in the training data, the predict() function still throws an output. While this gives no error(which was what I was looking for in the related question), my question is on what basis does the randomForest() model predict the value, as it ideally should have thrown the following error...

Error in predict.randomForest() : New factor levels not present in the training data

Want to know just out of curiosity if randomForest() method makes some inherent assumption for new factor levels in test data.

Here's a reproducible example:

seq1 <- c(5,3,1,3,1,"unwanted_char",4,2,2,3,0,4,1,1,0,1,0,1)
df1 <- matrix(seq1,6)
df1 <- as.data.frame(df1)
colnames(df1) <- c("a","b","c")
train <- df1[1:4,]
test <- df1[5:6,]

Now when we create a forest using train and run the predict() on test as follow...

forest1 <- randomForest(c~a+b,data=train,ntree=500)
test$prediction <- predict(forest1,test,type='response')

The test matrix contain's prediction of '1' for the last observation which has a = 'unwanted_char' and b = '4'.

Please note: When you create test and train data separately the predict function throws the above mentioned error instead of predicting.

This is a great question, but I would phrase it as "How does randomForest extrapolate factor variables". Also, you're treading on some thorny issues with factor handling. I would suggest editing your question to use letters as inputs to make the factor issues clear. Here's a candidate rewrite: gist.github.com/geneorama/6aa6c343506c47b980f0 — geneorama

Alex W Alex W · Accepted Answer · 2015-09-29T11:30:23

My opinion is that this is a very bad example; but, here's the answer:

Your created df1 only has factor variables and 4 observations. Here, mtry will equal 1, meaning that roughly 1/2 your trees will be based on b alone and 1/2 on a alone. When b == "4" the classification is always 1. IE- b == 4 perfectly predicts c. Similarly a == 1 perfectly predicts c == 0.

The reason that this works when you create the data in a single dataset is that the variables are factor variables, where the possible levels exist in both train and test, although the observed quantities for some levels == 0 in train. Since "unwanted_char" is a possible level in train$a (although unobserved) it's not problematic for your prediction. If you create these as separate datasets, the factor variables are created distinctly and test has new levels.

That is to say that, essentially, your problem works because you do not understand how factors work in R.

How does randomForest() predict for new factor levels not in training data?

3 Answers