10
votes

I'm running an elastic net on a generalized linear model with the glmnet and caret packages in R.

My response variable is cost (where cost > $0) and hence I'd like to specify a Gaussian family with a log link for my GLM. However glmnet doesn't seem to allow me to specify (link="log") as follows:

> lasso_fit <- glmnet(x, y, alpha=1, family="gaussian"(link="log"), lambda.min.ratio=.001)

I've tried different variants, with and without quotations, but no luck. The glmnet documentation doesn't discuss how to include a log link.

Am I missing something? Does family="gaussian" already implicitly assume a log link?

2
I think this could be difficult. If you dig into the glmnet code you'll see that glmnet(..., family="gaussian") calls elnet , which calls the Fortran spelnet function. (Poisson regression calls fishnet, which calls fishnet or spfishnet (for dense vs sparse model matrices.) So I suspect someone would have to start from scratch to write a variant of elastic net that handled the log linkage ... - Ben Bolker

2 Answers

5
votes

It is a bit confusing, but the family argument in glmnet and glm are quite different. In glm, you can specify a character like "gaussian", or you can specify a function with some arguments, like gaussian(link="log"). In glmnet, you can only specify the family with a character, like "gaussian", and there is no way to automatically set the link through that argument.

The default link for gaussian is the identity function, that is, no transformation at all. But, remember that a link function is just a transformation of your y variable; you can just specify it yourself:

glmnet(x, log(y), family="gaussian")

Also note that the default link for the poisson family is log, but the objective function will change. See the Details under ?glmnet in the first couple of paragraphs.


Your comments have led me to rethink my answer; I have evidence that it is not correct.

As you point out, there is a difference between E[log(Y)] and log(E[Y]). I think what the above code does is to fit E[log(Y)], which is not what you want. Here is some code to generate data and confirm what you noted in the comments:

# Generate data
set.seed(1)
x <- replicate(3,runif(1000))
y <- exp(2*x[,1] + 3*x[,2] + x[,3] + runif(1000))
df <- data.frame(y,x)

# Run the model you *want*
glm(y~., family=gaussian(link="log"), data=df)$coef
# (Intercept)          X1          X2          X3 
#   0.4977746   2.0449443   3.0812333   0.9451073 

# Run the model you *don't want* (in two ways)    
glm(log(y)~., family=gaussian(link='identity'), data=df)$coef
# (Intercept)          X1          X2          X3 
#   0.4726745   2.0395798   3.0167274   0.9957110 
lm(log(y)~.,data=df)$coef
# (Intercept)          X1          X2          X3 
#   0.4726745   2.0395798   3.0167274   0.9957110 

# Run the glmnet code that I suggested - getting what you *don't want*.
library(glmnet)
glmnet.model <- glmnet(x,log(y),family="gaussian", thresh=1e-8, lambda=0)
c(glmnet.model$a0, glmnet.model$beta[,1])
#        s0        V1        V2        V3 
# 0.4726745 2.0395798 3.0167274 0.9957110 
2
votes

I know that this is an old question, but in the current version of (4.0-2) it is possible to use glm family functions as arguments to "family" instead of a character string, so you could use:

glmnet(x, y, family=gaussian(link="log"))

Note that the package is faster when you use the string arguments.

Reference: https://glmnet.stanford.edu/articles/glmnetFamily.html