When I create training set and test set by splitting a single data frame and build a random forest using randomForest
package, for some factor levels which are not present in the training data, the predict()
function still throws an output. While this gives no error(which was what I was looking for in the related question), my question is on what basis does the randomForest()
model predict the value, as it ideally should have thrown the following error...
Error in predict.randomForest() :
New factor levels not present in the training data
Want to know just out of curiosity if randomForest()
method makes some inherent assumption for new factor levels in test data.
Here's a reproducible example:
seq1 <- c(5,3,1,3,1,"unwanted_char",4,2,2,3,0,4,1,1,0,1,0,1)
df1 <- matrix(seq1,6)
df1 <- as.data.frame(df1)
colnames(df1) <- c("a","b","c")
train <- df1[1:4,]
test <- df1[5:6,]
Now when we create a forest using train and run the predict()
on test as follow...
forest1 <- randomForest(c~a+b,data=train,ntree=500)
test$prediction <- predict(forest1,test,type='response')
The test matrix contain's prediction of '1' for the last observation which has a = 'unwanted_char' and b = '4'.
Please note: When you create test and train data separately the predict function throws the above mentioned error instead of predicting.
randomForest
extrapolate factor variables". Also, you're treading on some thorny issues with factor handling. I would suggest editing your question to use letters as inputs to make the factor issues clear. Here's a candidate rewrite: gist.github.com/geneorama/6aa6c343506c47b980f0 – geneorama