xgboost always predict 1 level with imbalance dataset

Question

I'm using xgboost to build a model. The dataset only has 200 rows and 10000 cols.

I tried chi-2 to get 100 cols, but my confusion matrix looks like this:

    1 0
1 190 0
0  10 0

I tried to use 10000 attributes, randomly select 100 attributes, select 100 attributes according to chi-2, but I never get 0 case predicted. Is it because of the dataset, or because the way I use xgboost?

My factor(pred.cv) is always showing only 1 level, while factor(y+1) has 1 or 2 as levels.

param <- list("objective" = "binary:logistic",
          "eval_metric" = "error",
          "nthread" = 2,
          "max_depth" = 5,
          "eta" = 0.3,
          "gamma" = 0,
          "subsample" = 0.8,
          "colsample_bytree" = 0.8,
          "min_child_weight" = 1,
          "max_delta_step"= 5,
          "learning_rate" =0.1,
          "n_estimators" = 1000,
          "seed"=27,
          "scale_pos_weight" = 1
          )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level

What is the proportion of the 1/2 levels in factor(y+1)?. If imbalanced perhaps try changing scale_pos_weight. — missuse
@missuse I think I have some clue. from online it says to do this: "pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/num.class, ncol=num.class)", but when I do this, num.class = 2, my confusionMatrix(factor(y+1), factor(pred.cv)) returns error "all arguments must have the same length". Because factor(y+1) is 980, but factor(pred.cv) is 980/2. Do you know how to fix it? — user7700501
Try pred.cv = ifelse(bst$pred < 0.5, 0, 1) and table(pred.cv, y). — missuse
@missuse I actually figure that out 5 minutes ago, that I didn't convert prob. to yes or no :/ thanks! — user7700501

Wang Shaowei Wang Shaowei · Accepted Answer · 2018-03-26T22:46:23

test

param <- list("objective" = "binary:logistic",
      "eval_metric" = "error",
      "nthread" = 2,
      "max_depth" = 5,
      "eta" = 0.3,
      "gamma" = 0,
      "subsample" = 0.8,
      "colsample_bytree" = 0.8,
      "min_child_weight" = 1,
      "max_delta_step"= 5,
      "learning_rate" =0.1,
      "n_estimators" = 1000,
      "seed"=27,
      "scale_pos_weight" = 1
      )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level

xgboost always predict 1 level with imbalance dataset

2 Answers