cv.glmnet fails for ridge, not lasso, for simulated data with coder error

Question

Gist

The error: Error in predmat[which, seq(nlami)] = preds : replacement has length zero

The context: data is simulated with a binary y, but there are n coders of true y. the data is stacked n times and a model is fitted, trying to get true y.

The error is received for

L2 penalty, but not L1 penalty.
when Y is the coder Y, but not when it is the true Y.
the error is not deterministic, but depends on seed.

UPDATE: the error is for versions after 1.9-8. 1.9-8 does not fail.

Reproduction

base data:

library(glmnet)
rm(list=ls())
set.seed(123)

num_obs=4000
n_coders=2
precision=.8

X <- matrix(rnorm(num_obs*20, sd=1), nrow=num_obs)
prob1 <- plogis(X %*% c(2, -2, 1, -1, rep(0, 16))) # yes many zeros, ignore
y_true <- rbinom(num_obs, 1, prob1)
dat <- data.frame(y_true = y_true, X = X)

create coders

classify <- function(true_y,precision){
  n=length(true_y)
  y_coder <- numeric(n)
  y_coder[which(true_y==1)] <- rbinom(n=length(which(true_y==1)),
                                      size=1,prob=precision)
  y_coder[which(true_y==0)] <- rbinom(n=length(which(true_y==0)),
                                      size=1,prob=(1-precision))
  return(y_coder)
}
y_codings <- sapply(rep(precision,n_coders),classify,true_y = dat$y_true)

stack it all

expanded_data <- do.call(rbind,rep(list(dat),n_coders))
expanded_data$y_codings <- matrix(y_codings, ncol = 1)

reproduce error

Since the error depends on seed, a loop is necessary. only the first loop will fail, the other two will finish.

X <- as.matrix(expanded_data[,grep("X",names(expanded_data))])

for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_codings,
                            family="binomial", alpha=0)  # will fail
for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_codings,
                            family="binomial", alpha=1)  # will not fail
for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_true,
                            family="binomial", alpha=0)  # will not fail

Any thoughts where this is coming from in glmnet and how to avoid it? from my reading of cv.glmnet, this is after the cv routine and is inside cvstuff = do.call(fun, list(outlist, lambda, x, y, weights, offset, foldid, type.measure, grouped, keep)), which I do not understand its role, hence the failure, and how to avoid it.

sessions (Ubuntu and PC)

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glmnet_2.0-2    foreach_1.4.3   Matrix_1.2-7.1  devtools_1.12.0

loaded via a namespace (and not attached):
 [1] httr_1.2.1       R6_2.2.0         tools_3.3.1      withr_1.0.2      curl_2.1        
 [6] memoise_1.0.0    codetools_0.2-15 grid_3.3.1       iterators_1.0.8  knitr_1.14      
[11] digest_0.6.10    lattice_0.20-34

and

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glmnet_2.0-2    foreach_1.4.3   Matrix_1.2-7.1  devtools_1.12.0

loaded via a namespace (and not attached):
 [1] httr_1.2.1       R6_2.2.0         tools_3.3.1      withr_1.0.2      curl_2.1        
 [6] memoise_1.0.0    codetools_0.2-15 grid_3.3.1       iterators_1.0.8  digest_0.6.10   
[11] lattice_0.20-34

This seems rather complicated. Why do you have y_codings when you already have y_true? What's the difference? — Hong Ooi
you do not observe y_true, but have some human coders that are coding y based on x, with some precision. @HongOoi — Elad663
Changing random seed fixed it for me: github.com/lmweber/glmnet-error-example/blob/master/… — Gary Weissman
I am getting the same error using glmnet_2.0-5 in a similar situation using ridge logistic regression. As the comment mentions in github.com/lmweber/glmnet-error-example/blob/master/…, after stepping through the code it is to do with mlami being larger than all lambda values. Has this bug been made clear to the glmnet developers? — rwolst

user2173836 user2173836 · Accepted Answer · 2017-01-27T13:45:44

I had the same error in glmnet_2.0-5 It has something to do with how are the lambdas automatically created in some situations. The solution is to provide own lambdas

E.g:

cv.glmnet(x = X,
          y = expanded_data$y_codings,
          family="binomial", 
          alpha=0,
          lambda=exp(seq(log(0.001), log(5), length.out=100)))

Kudos to https://github.com/lmweber/glmnet-error-example/blob/master/glmnet_error_example.R