0
votes

I trained a penalized regression model using R's glmnet package, and X constructed using a sparse.model.matrix with a formula of "~ . * (var1)" to get every term from my data and an interaction with var1:

X3 <- sparse.model.matrix(object = ~.*(var1), data = X)[,-1]

cv_lasso  <- cv.glmnet(x = X3, y = Y3, 
                       alpha = 1,
                       nfold = 10,
                       family = "binomial",
                       nlambda = 100,
                       lambda.min.ratio=0.001,
                       type.measure="auc",
                       keep = TRUE,
                       parallel = TRUE)

Now, I'm trying to predict on a couple of data points, but when converting the newX to a model.matrix to use with predict.glmnet(), like below:

X_pred <- sparse.model.matrix(object = ~.*(var1), data = X_holdout)
predict(object =  cv_lasso,
        newx = X_pred,
        s = "lambda.min")

But I get the following error:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

I believe this might be caused by a couple of columns from X_holdout that are basically constant (which is correct since I'm trying to predict now, I already trained successfully).

How can I avoid this problem? My understanding is that, since I trained my model using interactions, I have to create a model matrix with the same interactions in my predictions.

1

1 Answers

0
votes

Found the root of my problem: some of the prediction X columns were constant, since the holdout data is significantly smaller than the training data.

To fix this, I needed to use the "xlevs" argument in creating the sparse matrices for both the training data and the prediction data, and both with the same xlev.

In case you don't know what "xlev" is, it's basically a list of character vectors that indicate the levels to use when expanding factor variables into dummy/one-hot columns. This way, even if you have a column with only 1 value, sparse.matrix.model() can understand that there are more levels, it's just that they are not present in the data. This argument will also help you make sure that both the training and prediction matrices have the same number of columns, which is important for predict.glmnet()