Lasso: Cross-validation for glmnet

Question

I am using cv.glmnet() to perform cross-validation, by default 10-fold

library(Matrix)
library(tm)
library(glmnet)
library(e1071)
library(SparseM)
library(ggplot2)

trainingData <- read.csv("train.csv", stringsAsFactors=FALSE,sep=",", header = FALSE)
testingData  <- read.csv("test.csv",sep=",", stringsAsFactors=FALSE, header = FALSE)

x = model.matrix(as.factor(V42)~.-1, data = trainingData)
crossVal <- cv.glmnet(x=x, y=trainingData$V42, family="multinomial", alpha=1)
plot(crossVal)

I am having the following error message

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  one multinomial or binomial class has 1 or 0 observations; not allowed

But as it is shown below, I don't seem to have an observation level with counts of either 0 or 1.

>table(trainingData$V42)

       back buffer_overflow       ftp_write    guess_passwd            imap         ipsweep            land      loadmodule        multihop 
        956              30               8              53              11            3599              18               9               7 
    neptune            nmap          normal            perl             phf             pod       portsweep         rootkit           satan 
      41214            1493           67343               3               4             201            2931              10            3633 
      smurf             spy        teardrop     warezclient     warezmaster 
       2646               2             892             890              20

Any pointers?

use levels(trainingData$V42) if it is a factor variable and check which factor level has no observations in either the test or training set. — Vikram Venkat
The problem may be one of the factor levels not present in your test or training data — Vikram Venkat
This means V42 is stored as a string vector. Compare tables of both your testdata$V42 and traindata$V42. — Vikram Venkat
Why comparing with testData? It is a 10-fold cross-validation and it deals with my training dataset — Desta Haileselassie Hagos

Hong Ooi Hong Ooi · Accepted Answer · 2016-03-15T12:30:28

cv.glmnet does N-fold crossvalidation with N=10 by default. This means it splits your data into 10 subsets, then trains a model on 9 of the 10 and tests it on the remaining 1. It repeats this, leaving out each subset in turn.

Your data is sparse enough that sometimes, the training subset will run into the problem encountered here (and in your previous question). The best solution is to reduce the number of classes in your response by combining the rarer classes (do you really need to get a predicted probability for spy or perl for example).

Also, if you're doing glmnet crossvalidation and constructing a model matrix, you could use the glmnetUtils package I wrote to streamline the process.

Lasso: Cross-validation for glmnet

1 Answers