OK, so another newbie question related to the Titanic Competition:
I am trying to run a Random Forest prediction against my test data. All my work has been done on combined test and training data.
I have now split the 2 to testdata and trainingdata
I have the following code:
trainingdata <- droplevels(data.combined[1:891,])
testdata <- droplevels(data.combined[892:1309,])
fitRF <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
+ Parch + Fare + Embarked
+ new.title + family.size + FamilyID2,
data=trainingdata,
importance =T,
ntree=2000)
varImpPlot(fitRF)
#All works up to this point
Prediction <- predict(fitRF, testdata)
#This line above generates error
submit <- data.frame(PassengerID = data.combined$PassengerId, Survived
= Prediction)
write.csv(submit, file="14072017_1_RF", row.names = F)
When I run the Prediction line I get the following error:
> Prediction <- predict(fitRF, testdata)
Error in predict.randomForest(fitRF, testdata) :
New factor levels not present in the training data
When i run str(testdata) and str(trainingdata) I can see 2 factors that no longer match
Trainingdata
$ Parch : Factor w/ 7 levels
Testdata
$ Parch : Factor w/ 8
Trainingdata
$ FamilyID2 : Factor w/ 22
Testdata
$ FamilyID2 : Factor w/ 18
Is it these differences that are causing my error to occur? And if so, how do I resolve this?
Many Thanks
Additional Information: I have removed Parch and FamilyID2 from the RandomForest creation line, and the code now works, so it is definitely those 2 variables that are causing the issue with mismatched levels.
predict()
function if you runtestdata <- factor(testdata, levels=levels(trainingdata))
you shouldn't have any issue. – 1.618