3
votes

I have a CNN for regression that takes an image and outputs a float 0-10. My model is doing okay, but I have a serious problem with imbalanced data, making my model predict between 6-8 for almost all images, but achieving a decent mean squared error. I know of people weighting their classes based on how imbalanced their dataset is. So, is there a way to do this with a regression model? If it helps, my output is a float, but all of my data is in intervals of 0.5 in the 0-10 range, so there are in a way 20 different classes. Here is the distribution of my data labels.

enter image description here

I understand there are other methods such as:

  • Oversampling the minority group.
  • Using data augmentation to make "copies" of the minority group.
  • Optimizing a different performance metric. (No idea what that would be)

Any suggestions? Thanks.

2

2 Answers

3
votes

Your data might have originally represented a regression problem, but after binning it into 20 groups you are training your model on a 20-class classification problem. Thus you should treat it as such and look for finding ways to combat this imbalance. The most prevalent ways are:

  • oversampling the minority class(es)
  • undersampling the majority class(es)
  • using class weights

I'm usually prefer the first, because models tend to do better with more data, but the third is simpler to implement and doesn't add an extra computation cost to the training.

2
votes

One popular technique for over-sampling is SMOTE.

As for optimizing a different metric, one option is to optimize a weighted loss where the weights are proportional to the inverse of the class representation.