15
votes

I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.

I've got an example using the Iris data

df <- iris

y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)

df <- df[,1:4]

y <- as.matrix(y)
df <- as.matrix(df)

param <- list("objective" = "multi:softprob",    
          "num_class" = 3,    
          "eval_metric" = "mlogloss",    
          "nthread" = 8,   
          "max_depth" = 16,   
          "eta" = 0.3,    
          "gamma" = 0,    
          "subsample" = 1,   
          "colsample_bytree" = 1,  
          "min_child_weight" = 12)

model <- xgboost(param=param, data=df, label=y, nrounds=20)

This returns an error

Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) : 
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label

If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.

I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.

4
num_class is the number of distinct classes for classification problem. In your case with iris dataset it should be set to 3. - Sergey Bushmanov
It is set to 3. The error I pasted in above is from that setting. - House
Would you please post the output of unique(y)? - Sergey Bushmanov
No problem unique(y) [,1] [1,] "1" [2,] "2" [3,] "3" - House
You should set num_class=3, as I said before, and levels should sequentially start from 0 to 2. This would solve your problem. - Sergey Bushmanov

4 Answers

9
votes

label must be in [0, num_class) in your script add y<-y-1 before model <-...

4
votes

I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.

First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.

I re-encoded the labels as integers and then num_class worked fine when set to N.

# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1    <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)

and

param <- list(booster="gbtree",
              objective="multi:softprob",
              eval_metric="mlogloss",
              #nthread=13,
              num_class=3,
              eta_decay = .99,
              eta = .005,
              gamma = 1,
              max_depth = 4,
              min_child_weight = .9,#1,
              subsample = .7,
              colsample_bytree = .5
)

For example.

4
votes

I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.

0
votes

I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.

Tutorial said:

label = as.integer(iris$Species)-1

What worked for me (response is high_end):

label = as.integer(high_end)