14
votes

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code:

# de <- data imported from sql connection        
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)

**Co <- coef(?reg or reg1?,s=???)**

summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
       Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]

The beginning imports a large amount of data from my database in SQL. I then put it in matrix format and separate the response from the predictors.

This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two.

I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out?

Thanks in advance!

2
Try looking at plot(reg).Roland
Never rely on glmnet's default lambda sequence! Notorious issue. Always provide your own sequence. Then get the optimal lambda value afterwards from fit$lambda.min and use it with the s=lambda.min parameter in all calls to predict(), coef() etc.smci
@smci why not using lambda.1se? Exactly this one is used by predict()Alina
Could you please tell some details why not to use the predefined lambda and how to choose better sequence?pikachu
@smci Could you substantiate your claims about the default lambda sequence being garbage? Apart from my belief, that the authors of glmnet knew what they were doing, the sequence goes from a max lambda, for which all coefficients are guaranteed to be zero, to a very small one where usually all coefficients enter the model (depending of course on the shape of your matrix), which makes a lot of sense IMO. And in my cases it worked perfectly. Is there some class of models where it does not?Elmar Zander

2 Answers

20
votes

glmnet() is a R package which can be used to fit Regression models,lasso model and others. Alpha argument determines what type of model is fit. When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit.

cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. Bias Variance advantages is usually the motivation behind using such model validation methods. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda.

In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. You can then derive the Test MSE for that value of lambda. By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. Hope this helps!

Hope this helps!

2
votes

Between reg$lambda.min and reg$lambda.1se ; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value.