0
votes

So I've built a MLR model in R, it has a categorical variable in it with like 93 levels (so many). I tried grouping some levels or removing the predictor altogether but this had a negative impact so I've had to leave it in. Model seems to be working fine so I want to created a predicted vs observed plot, however when I run the predict function on my model it comes up with this error:

"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor C has new levels xxxx, yyyy"

Has anyone had this error before? I'm not sure how to fix it, and it only comes up when I try to predict.

Here's the code I used also:

lm12<-lm(log(A)~B+C+log(D)+E+F+log(G)+log(H), data=mydata)
pred<-predict(lm12,mydata)

(B and C are categorical, the rest are continuous.)

Thank you!

It's easier to help you if you include a simple reproducible example with sample input that can be used to test and verify possible solutions. The error message indicates that you are seeing a new value for one of your variables (C) in your prediction set than you saw in your training set. You can't predict on categorical variables you've never seen before with simple linear models. - MrFlick
try setting the levels on the test data to be the same as on the training data. levels(testdata$column) <- levels(traindata$column) - bgaerber
This is a recurrent issue, I am not sure there is 1 unique solution. You could for example remove the concerned variable(s) (your model don't know what to do for newcomers with it) or try to find a similar profile (based on row comparison) to put newcomers in an pre-existing class. - cbo
@bgaerber That's a very dangerous code recommendation, you will completely change the level meaning if they are not already the same and in the exact same order. For example if you had testdata <- data.frame(column=c("Male",; "Female")); traindata <- data.frame(column = c("Man", "Woman")) you would be swapping the meaning of the values. - MrFlick
Thank you everyone for your advice, I wasn't sure how to create a reproducible example since the data is so complex. I realised that some of the levels in the training set may not appear in the test set and vice versa so I have used all the data for the train and the error still appears? - Emily Drew