Multinomial classification using glmnet in caret fails

Question

I'm completely new to machine learning. But I'm working on data set and want to perform a three class classification problem and want to compare a few models using caret. When trying to use glmnet I encounter a problem and receive the following error messages:

returning Infmodel fit failed for Fold6.Rep10: alpha=0.4198, lambda=0.523974
Error in T[i, ] : subscript out of bounds
There were missing values in resampled performance measures.

Something is wrong; all the Mean_Balanced_Accuracy metric values are missing:
    logLoss         AUC          prAUC        Accuracy       Kappa        Mean_F1    Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision
 Min.   : NA   Min.   :0.5   Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA      Min.   : NA      Min.   : NA         Min.   : NA         Min.   : NA   
 1st Qu.: NA   1st Qu.:0.5   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA      1st Qu.: NA      1st Qu.: NA         1st Qu.: NA         1st Qu.: NA   
 Median : NA   Median :0.5   Median : NA   Median : NA   Median : NA   Median : NA   Median : NA      Median : NA      Median : NA         Median : NA         Median : NA   
 Mean   :NaN   Mean   :0.5   Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN      Mean   :NaN      Mean   :NaN         Mean   :NaN         Mean   :NaN   
 3rd Qu.: NA   3rd Qu.:0.5   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA      3rd Qu.: NA      3rd Qu.: NA         3rd Qu.: NA         3rd Qu.: NA   
 Max.   : NA   Max.   :0.5   Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA      Max.   : NA      Max.   : NA         Max.   : NA         Max.   : NA   
 NA's   :5                   NA's   :5     NA's   :5     NA's   :5     NA's   :5     NA's   :5        NA's   :5        NA's   :5           NA's   :5           NA's   :5     
  Mean_Recall  Mean_Detection_Rate Mean_Balanced_Accuracy
 Min.   : NA   Min.   : NA         Min.   : NA           
 1st Qu.: NA   1st Qu.: NA         1st Qu.: NA           
 Median : NA   Median : NA         Median : NA           
 Mean   :NaN   Mean   :NaN         Mean   :NaN           
 3rd Qu.: NA   3rd Qu.: NA         3rd Qu.: NA           
 Max.   : NA   Max.   : NA         Max.   : NA           
 NA's   :5     NA's   :5           NA's   :5             
Error: Stopping

Error traceback:

5.
stop("Stopping", call. = FALSE)
4.
train.default(x, y, weights = w, ...)
3.
train(x, y, weights = w, ...)
2.
train.formula(Species ~ ., data = dfiT, method = "glmnet", trControl = trCtr, metric = "Mean_Balanced_Accuracy", tuneLength = 5, family = "multinomial", type.multinomial = "grouped", standardize.response = F, maximize = T)
1.
train(Species ~ ., data = dfiT, method = "glmnet", trControl = trCtr, metric = "Mean_Balanced_Accuracy", tuneLength = 5, family = "multinomial", type.multinomial = "grouped", standardize.response = F, maximize = T)

When fitting the model using cv.glmnet the model runs without any issues and I get the expected output. However I seem to make a mistake, when using caret and I can't figure out what I'm doing wrong.

I'm not working with the iris data frame, but I could replicate the same error I get with my data frame by using the iris data. I also added a binary column, since my data also contains one. The number of observations in my classes are not equal, but that doesn't seem to be the problem here.

I think this is probably a beginners error but I can't seem to find a solution (either online, or in the manuals).

Does someone have a suggestion for a possible solution?

This is the code I'm using:

library(caret)
data("iris")
head(iris)
rm(iris)
dfi= iris

i = createDataPartition(dfi$Species,times = 1,p=.8,list=F)

dfiT = dfi[i,]
dfiTest = dfi[-i,]

pp <- preProcess(dfiT,method = c("nzv","YeoJohnson","center","scale"))
dfiT <- predict(pp,dfiT)
dfiTest <- predict(pp,dfiTest)
dfiT$bin = runif(length(dfiT))
dfiT$bin = ifelse(dfiT$bin>.5, 1,0)

dfiTest$bin = runif(length(dfiTest))
dfiTest$bin = ifelse(dfiTest$bin>.5, 1,0)

indFold = createMultiFolds(dfiT,
                           k=12,
                           times=10)

trCtr =trainControl(method = "repeatedcv",
                    savePredictions = "final",
                    returnResamp = "final",
                    classProbs = T,
                    summaryFunction = multiClassSummary,
                    selectionFunction = best,
                    search = "random",
                    sampling = "smote",
                    index = indFold
                    )

net.fit = train(Species~.,data=dfiT,
                method="glmnet",
                trControl=trCtr,
                metric = "Mean_Balanced_Accuracy",
                tuneLength = 5,
                family="multinomial",type.multinomial="grouped",standardize.response=F,
                maximize=T)

StupidWolf StupidWolf · Accepted Answer · 2020-11-07T13:39:17

You need to provide the outcome and not the dataframe to createMultiFolds():

indFold = createMultiFolds(dfiT$Species,
                           k=5,
                           times=2)

Running the rest of your code, some errors do appear but it is because the fit does not work well for some lambda values, you needa tune that:

There were 41 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: from glmnet Fortran code (error code -88); Convergence for 88th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
2: from glmnet Fortran code (error code -76); Convergence for 76th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned

The end result:

net.fit

glmnet 

120 samples
  5 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 96, 96, 96, 96, 96, 96, ... 
Addtional sampling using SMOTE

Resampling results across tuning parameters:

  alpha       lambda       logLoss    AUC        prAUC      Accuracy   Kappa  
  0.09922752  0.012060178  0.1853623  0.9973958  0.8701571  0.9375000  0.90625
  0.27810227  0.013560190  0.1891306  0.9979167  0.8712070  0.9541667  0.93125
  0.36162013  0.005174164  0.1216655  0.9989583  0.8730704  0.9666667  0.95000
  0.49706016  0.008790394  0.1433498  0.9979167  0.8710893  0.9625000  0.94375
  0.61841259  0.036938590  0.2514922  0.9955729  0.8667772  0.9666667  0.95000
  Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
  0.9359686  0.9375000         0.9687500         0.9503199          
  0.9533813  0.9541667         0.9770833         0.9627609          
  0.9660299  0.9666667         0.9833333         0.9731313          
  0.9617473  0.9625000         0.9812500         0.9701684          
  0.9663368  0.9666667         0.9833333         0.9718519          
  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
  0.9724186            0.9503199       0.9375000    0.3125000          
  0.9794863            0.9627609       0.9541667    0.3180556          
  0.9851508            0.9731313       0.9666667    0.3222222          
  0.9834079            0.9701684       0.9625000    0.3208333          
  0.9847495            0.9718519       0.9666667    0.3222222          
  Mean_Balanced_Accuracy
  0.953125              
  0.965625              
  0.975000              
  0.971875              
  0.975000

Multinomial classification using glmnet in caret fails

1 Answers