randomForest does not work when training set has more different factor levels than test set

Question

When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following:

Type of predictors in new data do not match that of the training data.

My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data).

When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it.

I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data?

Because if the levels don't match exactly, they could be coded differently. Factors associate labels to integers. So "Male" could be 1 in one set and 2 in another if the factors were created differently. This means you could potentially be predicting to something other than what you expected. R just confirms that all the levels are the same to be safe. You don't need to add observations to make them match, you just need to adjust the levels() of the factor. — MrFlick
Thanks for the answer. When I run levels(train$data) and levels(test$data), the numbers line up except the train$data has an extra factor at the end. Does this mean I have to manually drop that level every time? — bmcarterr
All the levels must match. You don't have to drop that level, you just need to add that level to the factor in the test data. You can add levels without adding observations. You can do test$val <- factor(test$val, levels=levels(train$val)) or something like that. You don't exactly have a reproducible example here so it's difficult to be specific — MrFlick

MrFlick MrFlick · Accepted Answer · 2014-07-21T19:57:47

R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do

test$val <- factor(test$val, levels=levels(train$val))

to make sure it has all the same levels and they are coded the same say.

(reposted here to close out the question)

randomForest does not work when training set has more different factor levels than test set

1 Answers