R - Random Forest - Delete New factor levels not present in the training data

Question

I'm debugging a code with Random Forest package, with barely no previous R experience.

I've reached a point where, excecuting predict.randomForest, I get the error:

New factor levels not present in the training data.

Searching this site I've found the reason and understood that I need to delete the records that are causing the problem.

How can I isolate (find out) which columns/rows are causing the problems?

Start by checking which columns in the matrix of predictors are factors. You can run str(X), where X is the matrix of predictors in your training data. Then do the same in your test data, and look in the output to see which one(s) have different numbers or sets of levels. — ulfelder
Thanks! The RF object has a lot of things on it... which one is the matrix of predictors you are referring to? And how do I check if each column is a factor? — DaroK

Tchotchke Tchotchke · Accepted Answer · 2015-08-13T14:30:20

Assume you have train.data, which you used to build your model, test.data, which you now want to get predictions for, and your factor variable factor.var1, then you could do:

levels(test.data$factor.var1) %in% levels(train.data$factor.var1)

Which will produce a logical vector corresponding to the factor levels in test.data, with the "FALSE" entries being the factor levels that were not present in your train.data.

R - Random Forest - Delete New factor levels not present in the training data

2 Answers