33
votes

Consider a simple dataset, split into a training and testing set:

dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
#   x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
#   x y z
# 5 5 e 1

When I train a logistic regression model to predict z using x and obtain test-set predictions, all is well:

mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
#         5 
# 0.5546394 

However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:

mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor y has new level e

Since I removed y from my model equation, I'm surprised to see this error message. In my application, dat is very wide, so z~.-y is the most convenient model specification. The simplest workaround I can think of is removing the y variable from my data frame and then training the model with the z~. syntax, but I was hoping for a way to use the original dataset without the need to remove columns.

2

2 Answers

43
votes

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
#        5 
#0.5546394 

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
#        5 
#0.5546394 
1
votes

I was confused about this issue for a long time. However, there was a simple solution to this. One of the variable "traffic type" had 20 factors and for one factor ie 17 there was only one row. Hence this row could be present either in train data or test data. In my case it was present in test data hence the error came - factor "traffic type" has a new level of 17 because there is no row with level 17in train data. I deleted this row from data set and model runs perfectly fine