Lasso Regression glmnet - error regarding the input data

Question

I try to fit a Lasso regression model using glmnet(). As I have never worked with Lasso regression before, I tried to get along with tutorials but when applying the model, it always results with the following error:

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,: 
one multinomial or binomial class has 1 or 0 observations; not allowed

Working with the dataset from this question (https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome) it seems that the dependent variable, the y, has to consist only of 0 and 1. Whenever I set one of the observation values of y to 2 or anything else than 0 or 1, it results in this error.

This is my code:

lambdas_to_try <- 10^seq(-3, 5, length.out = 100)

x_vars <- as.matrix(data.frame(data$x1, data$x2, data$x3))
lasso_cv <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)

x_vars_2 <- model.matrix(data$y ~ data$x1 + data$x2 + data$x3)[, -1]
lasso_cv_2 <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)

And this is how my dataset looks like:

The problem is, that in my data, the y variable represents the number of crimes, so it has integer values between 0 and 1000. I cannot set the value to 0 and 1 only. How does it work to use these data to apply a Lasso regression?

Please show the command you are using, not just the error. But it sounds like you are using logistic regression (GLM with binomial error structure) which requires binary data. If your data is binary, don't do that. Perhaps regular linear regression, or perhaps Poisson regression, since your data sounds like count data. — Gregor Thomas
If you show the code you are using, we can help you with how to change it. — Gregor Thomas
I added the code and a picture showing how my code looks like. But I need to make a Lasso regression, I do not have binary data, my data are count data. — the_chimp
When changing the family to poisson, it results in another error, although I do not have any negative values (I double checked my data). Any idea? Error in if (any(y < 0)) stop("negative responses encountered; not permitted for Poisson family") : Missing value, where TRUE/FALSE is necessary Warning: In Ops.factor(y, 0) : ‘<’ not meaningful for factors — the_chimp

StupidWolf StupidWolf · Accepted Answer · 2020-10-29T15:52:05

As @Gregor noted, what you have is count data, and it should be regression and not classification. Using an example dataset, this is how you can implement it:

library(MASS)
library(glmnet)
data(Insurance)

Your response variable should be numeric:

str(Insurance)
'data.frame':   64 obs. of  5 variables:
 $ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Group   : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
 $ Age     : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
 $ Holders : int  197 264 246 1680 284 536 696 3582 133 286 ...
 $ Claims  : int  38 35 20 156 63 84 89 400 19 52 ...

Now we set the predictors and response variables:

y = Insurance$Claims
X = model.matrix(Claims ~ .,data=Insurance)

Run a cv to find the best lambda (if you don't know your L1 norm):

fit = cv.glmnet(x=X,y=y,family="poisson")
pred = predict(fit,X,s=fit$lambda.1se)

The prediction is in log scale, so to compare with your actual

plot(log(y),pred,xlab="log (actual)",ylab="log (predicted)")

Lasso Regression glmnet - error regarding the input data

1 Answers