Why classification models don't work on class imbalanced setting?

Question

There are many posts and resources on how to combat a class imbalance problem, namely over-sampling the minority class or under-sampling the majority class.

I also understand that using accuracy to evaluate the model performance on an imbalanced problem would be wrong.

However, I didn't find many resources talking about why ML models fail in class imbalanced problems in the first place. Is it simply because the loss function usually is the sum of all the data points, so a model will tend to put more emphasis on a majority class data and not on a minority class data?

Second, in real applications, such as a fraud detection or a click prediction ( where class imbalances happen ), why would changing the distribution by over(under)-sampling of training set be a good thing to do ? Wouldn't we want the classifier to reflect the real distribution (which is imbalanced in its nature) ? Let's say I have a logistic regression model trained to predict fraud and let's assume that the fraud rate is 2%. Over-sampling the fraud events essentially tells the model that the fraud rate is not 2%, but 50% (say). Is that a good thing to do ?

To summarize. Two questions:

Why would ML models fail in class imbalanced setting? Is it because of the loss function usually is composed of sum of losses of individual data points?
Why is the over(under)-sampling, which essentially changes how the model sees the problem, a good way? Why not let the model reflect truthfully the distribution of the classes ?

David Dale David Dale · Accepted Answer · 2018-02-14T08:24:41

TL;DR: the "curse" of class imbalance is kind of a myth, relevant only for certain types of problems.

Not all ML models fail in class imbalance setting. Most of probabilistic models are not seriously affected by class imbalance. The problems usually arise when we switch to non-probabilistic or multiclass prediction.

In logistic regression (and its generalization - neural networks), class imbalance strongly affects intercept, but has very small influence on the slope coefficients. Intuitively, the predicted odds ratio log(p(y=1)/p(y=0)) = a+x*b from binary logistic regression changes by a fixed amount when we change prior probabilities of classes, and this effect is caught by the intercept a.

In decision tree (and its generalization - random forest and gradient boosted trees), class imbalance affects leaf impurity metrics, but this effect is roughly equal for all candidate splits, so it usually does not affect the choice of splits much (the details).

On the other hand, non-probabilistic models like SVM can be seriously affected by class imbalance. SVM learns its separating hyperplane in such a way that roughly the same number of positive and negative examples (support observations) lie on the border or on its wrong side. Therefore, resampling can dramatically change these numbers and the position of the border.

When we use probabilistic models for binary classification, everything is OK: at training time, models don't depend much on imbalance, and for testing we can use imbalance-insensitive metrics like ROC AUC, which depend on predicted class probability, instead of "hard" discrete classification.

However, these metrics do not easily generalize for multiclass classification, and we usually exploit simple accuracy to evaluate multiclass problems. And accuracy has known issues with class imbalance: it is based on hard classification, which may completely ignore the rare classes. This is the case when most practitioners turn to oversampling. However, if you stick to probabilistic prediction and measure performance with log loss (aka cross-entropy), you can still survive class imbalance.

Over-sampling is good when you don't want probabilistic classification. In this case, "distribution of the classes" is kind of irrelevant.

Imagine an application when you don't need the probability that there is a cat on the picture, you just want to know whether this image is more similar to images of cats than to images of dogs. In this settings, it may be desired that cats and dogs have equal number of "votes", even if in the original training sample cats were the majority.

In other applications (like fraud detection, click prediction, or my favorite credit scoring), what you really need is not "hard" classification, but ranking: which customers are more likely to cheat, click, or default, that the others? In this case, it is not important whether the sample is imbalanced, because the cutoff is usually set by hand (from economic considerations, such as cost analysis). However, in such applications it may be helpful to predict the "true" probability of fraud (or click, or default), and upsampling is thus unwanted.

Why classification models don't work on class imbalanced setting?

2 Answers