1
votes

I'm debugging a code with Random Forest package, with barely no previous R experience.

I've reached a point where, excecuting predict.randomForest, I get the error:

New factor levels not present in the training data.

Searching this site I've found the reason and understood that I need to delete the records that are causing the problem.

How can I isolate (find out) which columns/rows are causing the problems?

2
Start by checking which columns in the matrix of predictors are factors. You can run str(X), where X is the matrix of predictors in your training data. Then do the same in your test data, and look in the output to see which one(s) have different numbers or sets of levels.ulfelder
Thanks! The RF object has a lot of things on it... which one is the matrix of predictors you are referring to? And how do I check if each column is a factor?DaroK

2 Answers

4
votes

Assume you have train.data, which you used to build your model, test.data, which you now want to get predictions for, and your factor variable factor.var1, then you could do:

levels(test.data$factor.var1) %in% levels(train.data$factor.var1)

Which will produce a logical vector corresponding to the factor levels in test.data, with the "FALSE" entries being the factor levels that were not present in your train.data.

0
votes

simple.solutions to this would be rbind test data with training data and predict ,then subset the row which you want prediction .This worked for me