4
votes

I have a feature dataset with 5000 rows, on which I would like to do binary classification. I have 2 class vectors for it:

Y1 - the classes are pretty balanced (0 - 52%/ 1- 48%)

Y2 - the classes are very imbalanced (0 - 90%/1 - 10%)

I've split the dataset to a training set (4,000 samples) and a test set (1,000 samples).

Then, I've written simple code to get a dataset X and class vector Y, and created a balanced dataset with len = 2 X number of minority class.

For example, in the training dataset above, using the 90%/10% class vector, there will be 400 1s and 3,200 0s, so it will create a new 800 sample dataset with the original 400 samples of class 1 and 400 randomly chosen samples of class 0, which will be balanced.

So from a 4,000 sample imbalanced training set, I get an 800 sample balanced dataset, and use it for training the learning algorithm.

I then use the model that was created on the additional 1,000 samples (test set).

I ran the balancing code on both class vectors - the balanced and the imbalanced one (even though I did not need it in the balanced class vector).

When using the balanced class vector, I get this confusing matrix for the 1,000 sample test set:

[339 126

288 246]

     precision    recall  f1-score   support

0.0       0.54      0.73      0.62       465
1.0       0.66      0.46      0.54       534
avg / total 0.61 0.59 0.58 999

When using the imbalanced class vector, I get this confusing matrix for the 1,000 sample test set:

[574 274

73 78]

     precision    recall  f1-score   support

0.0       0.89      0.68      0.77       848
1.0       0.22      0.52      0.31       151
avg / total 0.79 0.65 0.70 999

As you can see, the precision of class 1 is very low.

I also used several algorithms from the package imbalanced-learn, to create a balanced dataset (like under sampling, using cluster centroids, or over-sampling using SMOTE SVM), but the result is always the same - the precision of class 1 (the minority class) stays very low.

Could you please advise what you would do in such a situation? My goal is to try and bring the precision of class 1, in the imbalanced class vector, to around 0.6, as it is in the balanced class vector.

1
I've encounterd a similar situation, have you figured out any solution?Charlotte

1 Answers

1
votes

On your place I would put proportionally greater weight on under-represented class. XGBoost provides a rich set of parameters with which you can play to build a good model. This article discusses them in detail for Python. Check specifically scale_pos_weight parameter.

On top of that, I would also consider adding a validation set to assess model's accuracy.