8
votes

I have a dataset of some 20000 training examples, on which i want to do a binary classification. The problem is the dataset is heavily imbalanced with only around 1000 being in the positive class. I am trying to use xgboost (in R) for doing my prediction.

I have tried oversampling and undersampling and no matter what i do, somehow the predictions always result in classifiying everything as the majority class.

I tried reading this article on how to tune parameters in xgboost. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

But it only mentions which parameters help with imbalanced datasets, but not how to tune them.

I would appreciate if anyone has any advice on tuning the learning parameters of xgboost to handle imbalanced datasets and also on how to generate the validation set for such cases.

3

3 Answers

10
votes

According to XGBoost documentation, the scale_pos_weight parameter is the one dealing with imbalanced classes. See, documentation here

scale_pos_weight, [default=1] Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3

4
votes

Try something like this in R

bstSparse <- xgboost(data =xgbTrain , max_depth = 4, eta = 0.2, nthread = 2, nrounds = 200 ,
                 eval_metric = "auc" , scale_pos_weight = 48, colsample_bytree = 0.7,
                 gamma = 2.5,
                 eval_metric = "logloss",
                 objective = "binary:logistic")

Where scale_pos_weight is the imbalance. My baseline incidence rate is ~ 4%. use hyper parameter optimization. Can try that on scale_pos_weight too

1
votes

A technique useful with neural networks is to introduce some noise into the observations. In R there is the 'jitter' function to do this. For your 1000 rare cases only apply a small amount of jitter to their features to give you another 1000 cases. Run your code again and see if the predictions are now picking up any of the positive class. You can experiment with more added cases and/or varying the amount of jitter. HTH, cousin_pete