1
votes

I am trying to fit a Lasso regression with a cross-validated lambda using glmnet and caret package. My code is,

dim(x)
# 121755    465
dim(y)
# 121755      1

### cv.glmnet
set.seed(2108)
cl <- makePSOCKcluster(detectCores()-2,outfile="")
registerDoParallel(cl)
system.time(
  las.glm <- cv.glmnet(x=x, y=y,alpha=1,type.measure="mse",parallel = TRUE,
                      nfolds = 5, lambda = seq(0.001,0.1,by = 0.001),
                      standardize=F) 
)
stopCluster(cl)

# user  system elapsed 
# 17.98 2.28   37.23 

### caret
caretctrl <- trainControl(method = "cv", number = 5)
tune <- expand.grid(alpha=1,lambda = seq(0.001,0.1,by = 0.001))

set.seed(2108)
cl <- makePSOCKcluster(detectCores()-2,outfile="")
registerDoParallel(cl)
system.time(
  las.car <- train(x=x, y=as.numeric(y),alpha=1,method="glmnet",
                   metric="RMSE", allowParallel = TRUE,
                   trControl = caretctrl, tuneGrid = tune) 
)
stopCluster(cl)

# error
Something is wrong; all the RMSE metric values are missing:
  RMSE        Rsquared        MAE     
Min.   : NA   Min.   : NA   Min.   : NA  
1st Qu.: NA   1st Qu.: NA   1st Qu.: NA  
Median : NA   Median : NA   Median : NA  
Mean   :NaN   Mean   :NaN   Mean   :NaN  
3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA  
Max.   : NA   Max.   : NA   Max.   : NA  
NA's   :100   NA's   :100   NA's   :100  
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
Timing stopped at: 3.97 1.37 127.9

I understand that this might be due to not having enough data in one of the resamples, but I doubt that should be an issue with my data size and just 5 folds. I have tried the following solutions that didn't work for me:

I reckon that caret is performing some other resampling which the glmnet is not performing leading to the error. Can someone shed any light on this problem?

Edit 1

x is a semi-sparse matrix of 210 indicator and 255 continuous variables.

1
I have also tested by using allowParallel in the trainControl without any luck.FightMilk

1 Answers

1
votes

I think most of the problem comes with setting alpha=1 again in train, using an example. So even if your x,y are sparse, it will work:

library(glmnet)
library(caret)
library(Matrix)

dat = Matrix(as.matrix(mtcars),sparse=TRUE)
x = as.matrix(mtcars[,-1])
y = as.matrix(mtcars[,1])

L = seq(0.001,0.1,by = 0.02)

las.glm <- cv.glmnet(x=x, y=y,alpha=1,type.measure="mse",nfolds = 5, lambda = L,standardize=FALSE)

So cv.glmnet works, now if we try your code, it returns the error:

caretctrl <- trainControl(method = "cv", number = 5)
tune <- expand.grid(alpha=1,lambda = L)

las.car <- train(x=x, y=as.numeric(y),alpha=1,method="glmnet",
                   metric="RMSE",trControl = caretctrl, tuneGrid = tune) 

Something is wrong; all the RMSE metric values are missing:
      RMSE        Rsquared        MAE     
 Min.   : NA   Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA   1st Qu.: NA  

Remove the alpha argument:

las.car <- train(x=x, y=as.numeric(y),method="glmnet",
                       metric="RMSE",trControl = caretctrl, tuneGrid = tune) 

glmnet 

32 samples
10 predictors

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 25, 26, 26, 26, 25 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE     
  0.001   3.798431  0.7689346  3.003005
  0.021   3.360426  0.7821630  2.714694
  0.041   3.099981  0.7958414  2.543577
  0.061   2.842374  0.8066351  2.328833
  0.081   2.801421  0.8046289  2.301098

And it will also work with a dense matrix.