When trying to use the output of randomForest
to classify new data (or even the original training data), I get the following error:
> res.rf5 <- predict(model.rf5, train.rf5)
Error in predict.randomForest(model.rf5, train.rf5) :
New factor levels not present in the training data
What does this error mean? Why does this error occur even when I try to predict the same data I used to train?
A small example that can be used to reproduce the error is below.
train.rf5 <- structure(
list(A = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L),
.Label = c("(-0.1,19.9]", "(19.9,40]", "(80.1,100]"),
class = c("ordered", "factor")),
B = structure(c(3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L),
.Label = c("1", "2", "4", "5"),
class = c("ordered", "factor")),
C = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L),
.Label = c("FALSE", "TRUE"),
class = "factor")),
.Names = c("A", "B", "C"),
row.names = c(7L, 8L, 10L, 11L, 13L, 15L, 16L, 17L, 18L, 19L),
class = "data.frame")
# A B C
# 7 (19.9,40] 4 FALSE
# 8 (-0.1,19.9] 1 FALSE
# 10 (-0.1,19.9] 1 TRUE
# 11 (-0.1,19.9] 1 FALSE
# 13 (-0.1,19.9] 1 FALSE
# 15 (-0.1,19.9] 1 TRUE
# 16 (80.1,100] 2 TRUE
# 17 (-0.1,19.9] 1 FALSE
# 18 (-0.1,19.9] 1 FALSE
# 19 (80.1,100] 5 TRUE
require(randomForest)
model.rf5 <- randomForest(C ~ ., data = train.rf5)
res.rf5 <- predict(model.rf5, train.rf5) # Causes error
I see some possibly related questions on SO, but I don't think they solve my issue directly
- dropping factor levels in a subsetted data frame in R
- Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?
Unlike 1), I do not have factor levels that are not represented in the data, and unlike 2), the factor levels in my train and test data are identical.
Edit: Additional information:
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForest_4.6-7
loaded via a namespace (and not attached):
[1] tools_3.0.1