Random Forest in R: New factor levels not present in the training data

Question

OK, so another newbie question related to the Titanic Competition:

I am trying to run a Random Forest prediction against my test data. All my work has been done on combined test and training data.

I have now split the 2 to testdata and trainingdata

I have the following code:

trainingdata <- droplevels(data.combined[1:891,])
testdata <- droplevels(data.combined[892:1309,])

fitRF <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp 
+ Parch + Fare + Embarked
                   + new.title + family.size + FamilyID2,
                  data=trainingdata,
                  importance =T,
                  ntree=2000)

varImpPlot(fitRF)

#All works up to this point


Prediction <- predict(fitRF, testdata)
#This line above generates error
submit <- data.frame(PassengerID = data.combined$PassengerId, Survived 
= Prediction)
write.csv(submit, file="14072017_1_RF", row.names = F)

When I run the Prediction line I get the following error:

> Prediction <- predict(fitRF, testdata)
Error in predict.randomForest(fitRF, testdata) : 
  New factor levels not present in the training data

When i run str(testdata) and str(trainingdata) I can see 2 factors that no longer match

Trainingdata      
$ Parch            : Factor w/ 7 levels 

Testdata
$ Parch            : Factor w/ 8

Trainingdata
$ FamilyID2        : Factor w/ 22 

Testdata
$ FamilyID2        : Factor w/ 18

Is it these differences that are causing my error to occur? And if so, how do I resolve this?

Many Thanks

Additional Information: I have removed Parch and FamilyID2 from the RandomForest creation line, and the code now works, so it is definitely those 2 variables that are causing the issue with mismatched levels.

Possible duplicate of Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error? — Prasanna Nandakumar
I did look at that post, and tried to implement the solution but the error was the same. — Jade Reynolds
Before predict() function if you run testdata <- factor(testdata, levels=levels(trainingdata)) you shouldn't have any issue. — 1.618
I ran that command and it destroyed the testdata dataset, I now have 1 column made up of all of the headings from its previous rows — Jade Reynolds

cremorna cremorna · Accepted Answer · 2017-07-14T12:14:39

Fellow newbie here, I was just toying around with Titanic these days. I think it doesn´t make sense to have the Parch variable as a factor, so maybe make it numeric and that may solve the problem:

train$Parch <- as.numeric(train$Parch)

Otherwise, the test data has 2 obs with the value of 9 for Parch, which are not present in the train data:

> table(train$Parch)

0   1   2   3   4   5   6 
678 118  80   5   4   5   1 

> table(test$Parch)

0   1   2   3   4   5   6   9 
324  52  33   3   2   1   1   2 
>

Alternatively, if you need the variable to be a factor, then you could just add another level to it:

train$Parch <- as.factor(train$Parch) # in my data, Parch is type int
train$Parch
levels(train$Parch) <- c(levels(train$Parch), "9") 
train$Parch # now Parch has 7 levels
table(train$Parch) # level 9 is empty

Random Forest in R: New factor levels not present in the training data

1 Answers