My training data has extremely class imbalanced {0:872525,1:3335} with 100 features. I use xgboost to build classification model with bayessian optimisation to hypertune the model in range {learning rate:(0.001,0.1), min_split_loss:(0.10), max_depth:(3,70), min_child_weight:(1:20), max_delta_step:(1,20), subsample:(0:1), colsample_bytree:(0.5,1), lambda:(0,10), alpha:(0,10), scale_pos_weight:(1,262), n_estimator:(1,20)}. I also using binary:logistics as the objective model and roc_auc as the metrics with booster gbtree. The cross validation score is 82.5%. However, when I implemented the model to the testing data I got the score only Roc_auc: 75.2%, pr_auc: 15%, log_loss: 0.046, and confusion matrix: [[19300 7],[103 14]]. I need helping to find the best way to increase the true possitive to be around 60% with tolerance false positive until 3 times actual positive.
0
votes
1 Answers
0
votes
You mentioned that your dataset is very imbalanced.
I'd recommend looking at imblearn, which is "a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance." These techniques include, for example, over and under sampling.
The full documentation and examples for the library are here.
If you are working on this dataset in a company - you can also investigate getting more data or pruning your dataset using rules/heuristics.