I have a feature dataset with 5000 rows, on which I would like to do binary classification. I have 2 class vectors for it:
Y1 - the classes are pretty balanced (0 - 52%/ 1- 48%)
Y2 - the classes are very imbalanced (0 - 90%/1 - 10%)
I've split the dataset to a training set (4,000 samples) and a test set (1,000 samples).
Then, I've written simple code to get a dataset X
and class vector Y
, and created a balanced dataset with len = 2
X number of minority class
.
For example, in the training dataset above, using the 90%/10% class vector, there will be 400 1s and 3,200 0s, so it will create a new 800 sample dataset with the original 400 samples of class 1 and 400 randomly chosen samples of class 0, which will be balanced.
So from a 4,000 sample imbalanced training set, I get an 800 sample balanced dataset, and use it for training the learning algorithm.
I then use the model that was created on the additional 1,000 samples (test set).
I ran the balancing code on both class vectors - the balanced and the imbalanced one (even though I did not need it in the balanced class vector).
When using the balanced class vector, I get this confusing matrix for the 1,000 sample test set:
[339 126
288 246]
precision recall f1-score support
0.0 0.54 0.73 0.62 465
1.0 0.66 0.46 0.54 534
avg / total 0.61 0.59 0.58 999
When using the imbalanced class vector, I get this confusing matrix for the 1,000 sample test set:
[574 274
73 78]
precision recall f1-score support
0.0 0.89 0.68 0.77 848
1.0 0.22 0.52 0.31 151
avg / total 0.79 0.65 0.70 999
As you can see, the precision of class 1 is very low.
I also used several algorithms from the package imbalanced-learn, to create a balanced dataset (like under sampling, using cluster centroids, or over-sampling using SMOTE SVM), but the result is always the same - the precision of class 1 (the minority class) stays very low.
Could you please advise what you would do in such a situation? My goal is to try and bring the precision of class 1, in the imbalanced class vector, to around 0.6, as it is in the balanced class vector.