0
votes

I have a dataset with 66854 samples. It has 4 columns which are "Height", "Weight", "Belly", "Hip". Belly and Hip are mapped to 0,1,2 (respectively narrow, medium, large). I am trying to predict Jean size with these information.

Here how many data for each class:

df2["Jean"].value_counts()
28    11780
27    10166
26     9259
29     7260
30     6905
32     5688
25     5196
24     3932
31     3603
33     3065
Name: Jean, dtype: int64

After splitting 0.8 train, 0.2 test with train_test_split() from sklearn and training a Logistic Regression Model with default parameters i am getting this classification report:

              precision    recall  f1-score   support

          24       0.39      0.40      0.39      1966
          25       0.00      0.00      0.00      2598
          26       0.27      0.45      0.34      4630
          27       0.25      0.14      0.18      5083
          28       0.28      0.59      0.38      5890
          29       0.00      0.00      0.00      3630
          30       0.26      0.28      0.27      3453
          31       0.00      0.00      0.00      1801
          32       0.31      0.40      0.35      2844
          33       0.58      0.36      0.44      1532

    accuracy                           0.29     33427
   macro avg       0.23      0.26      0.23     33427
weighted avg       0.23      0.29      0.24     33427

As you can see above, classes 25, 29 and 31 are all precision-recall zero and when i try to use this model i never get those classes predicted. Any reason for that? Any fixes?

I’m voting to close this question because it is not about programming as defined in the help center but about ML theory and/or methodology - please see the intro and NOTE in stackoverflow.com/tags/machine-learning/info - desertnaut