1
votes

I am getting an error from my datasets similar logic with the code I posted in below. I have tried increased the number of training data but didn't solve. I have already excluded all NA values.

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor y has new levels L, X

set.seed(234)
d <- data.frame(w=abs(rnorm(50)*1000),
            x=rnorm(50), 
            y=sample(LETTERS[1:26], 50, replace=TRUE))



train_idx <- sample(1:nrow(d), floor(0.8*nrow(d)))
train <- d[train_idx,]
test  <- d[-train_idx,]



fit  <- lm(w ~x + y, data=train)
predict(fit, test)
3

3 Answers

3
votes

As @jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach:

Let's try below line of code just before predict statement

#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))

Hope this would resolve your problem!

2
votes

Factor and character data are treated as categorical variables. As such, models cannot form predictions for category labels they've never seen before. If you built a model to predict things about "poodle" and "pit bull", the model would fail if you gave it "golden retriever".

More specific to your example, the error is telling you that labels "L" and "X", which are in your test set, do not appear in your training set. Since they weren't in the training set, the model doesn't know what to do when it encounters these in the test.

0
votes

Thanks Prem, and if you have many variables you can loop the line of code like this:

for(k in vars){
  if(is.factor(shop_data[,k])){
    ols_fit$xlevels[[k]] <- union(ols_fit$xlevels[[k]],levels(shop_data[[k]]))
   }
}

vars are the variables used in the model, shop_data is the main dataset which is split into train and test