1
votes

When I create training set and test set by splitting a single data frame and build a random forest using randomForest package, for some factor levels which are not present in the training data, the predict() function still throws an output. While this gives no error(which was what I was looking for in the related question), my question is on what basis does the randomForest() model predict the value, as it ideally should have thrown the following error...

Error in predict.randomForest() : New factor levels not present in the training data

Want to know just out of curiosity if randomForest() method makes some inherent assumption for new factor levels in test data.

Here's a reproducible example:

seq1 <- c(5,3,1,3,1,"unwanted_char",4,2,2,3,0,4,1,1,0,1,0,1)
df1 <- matrix(seq1,6)
df1 <- as.data.frame(df1)
colnames(df1) <- c("a","b","c")
train <- df1[1:4,]
test <- df1[5:6,]

Now when we create a forest using train and run the predict() on test as follow...

forest1 <- randomForest(c~a+b,data=train,ntree=500)
test$prediction <- predict(forest1,test,type='response')

The test matrix contain's prediction of '1' for the last observation which has a = 'unwanted_char' and b = '4'.

Please note: When you create test and train data separately the predict function throws the above mentioned error instead of predicting.

3
This is a great question, but I would phrase it as "How does randomForest extrapolate factor variables". Also, you're treading on some thorny issues with factor handling. I would suggest editing your question to use letters as inputs to make the factor issues clear. Here's a candidate rewrite: gist.github.com/geneorama/6aa6c343506c47b980f0geneorama

3 Answers

1
votes

My opinion is that this is a very bad example; but, here's the answer:

Your created df1 only has factor variables and 4 observations. Here, mtry will equal 1, meaning that roughly 1/2 your trees will be based on b alone and 1/2 on a alone. When b == "4" the classification is always 1. IE- b == 4 perfectly predicts c. Similarly a == 1 perfectly predicts c == 0.

The reason that this works when you create the data in a single dataset is that the variables are factor variables, where the possible levels exist in both train and test, although the observed quantities for some levels == 0 in train. Since "unwanted_char" is a possible level in train$a (although unobserved) it's not problematic for your prediction. If you create these as separate datasets, the factor variables are created distinctly and test has new levels.

That is to say that, essentially, your problem works because you do not understand how factors work in R.

0
votes

Error in predict.randomForest() : New factor levels not present in the training data

This error is quite confusing ,you may wish to rbind your dataset which needs to be predicted with dataset in which model was built and do prediction .

post prediction subset with rownum ,easy and tested methodology

0
votes

I concur with Alex that this is not a good example.

Here is the answer to your question:

       str(train)

If you check the structure of your train data, you will see that variable 'a' has all 4 levels, because the levels were assigned when you created the dataframe df1.