I am trying to fit a Lasso regression with a cross-validated lambda using glmnet
and caret
package. My code is,
dim(x)
# 121755 465
dim(y)
# 121755 1
### cv.glmnet
set.seed(2108)
cl <- makePSOCKcluster(detectCores()-2,outfile="")
registerDoParallel(cl)
system.time(
las.glm <- cv.glmnet(x=x, y=y,alpha=1,type.measure="mse",parallel = TRUE,
nfolds = 5, lambda = seq(0.001,0.1,by = 0.001),
standardize=F)
)
stopCluster(cl)
# user system elapsed
# 17.98 2.28 37.23
### caret
caretctrl <- trainControl(method = "cv", number = 5)
tune <- expand.grid(alpha=1,lambda = seq(0.001,0.1,by = 0.001))
set.seed(2108)
cl <- makePSOCKcluster(detectCores()-2,outfile="")
registerDoParallel(cl)
system.time(
las.car <- train(x=x, y=as.numeric(y),alpha=1,method="glmnet",
metric="RMSE", allowParallel = TRUE,
trControl = caretctrl, tuneGrid = tune)
)
stopCluster(cl)
# error
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :100 NA's :100 NA's :100
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Timing stopped at: 3.97 1.37 127.9
I understand that this might be due to not having enough data in one of the resamples, but I doubt that should be an issue with my data size and just 5 folds. I have tried the following solutions that didn't work for me:
- Insert vectors and not a formula
- Use
allowParallel
when CPU is not multithreaded - There are no missing data
I reckon that caret
is performing some other resampling which the glmnet
is not performing leading to the error. Can someone shed any light on this problem?
Edit 1
x is a semi-sparse matrix of 210 indicator and 255 continuous variables.
allowParallel
in thetrainControl
without any luck. – FightMilk