TL;DR: the "curse" of class imbalance is kind of a myth, relevant only for certain types of problems.
- Not all ML models fail in class imbalance setting. Most of probabilistic models are not seriously affected by class imbalance. The problems usually arise when we switch to non-probabilistic or multiclass prediction.
In logistic regression (and its generalization - neural networks), class imbalance strongly affects intercept, but has very small influence on the slope coefficients. Intuitively, the predicted odds ratio log(p(y=1)/p(y=0)) = a+x*b
from binary logistic regression changes by a fixed amount when we change prior probabilities of classes, and this effect is caught by the intercept a
.
In decision tree (and its generalization - random forest and gradient boosted trees), class imbalance affects leaf impurity metrics, but this effect is roughly equal for all candidate splits, so it usually does not affect the choice of splits much (the details).
On the other hand, non-probabilistic models like SVM can be seriously affected by class imbalance. SVM learns its separating hyperplane in such a way that roughly the same number of positive and negative examples (support observations) lie on the border or on its wrong side. Therefore, resampling can dramatically change these numbers and the position of the border.
When we use probabilistic models for binary classification, everything is OK: at training time, models don't depend much on imbalance, and for testing we can use imbalance-insensitive metrics like ROC AUC, which depend on predicted class probability, instead of "hard" discrete classification.
However, these metrics do not easily generalize for multiclass classification, and we usually exploit simple accuracy to evaluate multiclass problems. And accuracy has known issues with class imbalance: it is based on hard classification, which may completely ignore the rare classes. This is the case when most practitioners turn to oversampling. However, if you stick to probabilistic prediction and measure performance with log loss (aka cross-entropy), you can still survive class imbalance.
- Over-sampling is good when you don't want probabilistic classification. In this case, "distribution of the classes" is kind of irrelevant.
Imagine an application when you don't need the probability that there is a cat on the picture, you just want to know whether this image is more similar to images of cats than to images of dogs. In this settings, it may be desired that cats and dogs have equal number of "votes", even if in the original training sample cats were the majority.
In other applications (like fraud detection, click prediction, or my favorite credit scoring), what you really need is not "hard" classification, but ranking: which customers are more likely to cheat, click, or default, that the others? In this case, it is not important whether the sample is imbalanced, because the cutoff is usually set by hand (from economic considerations, such as cost analysis). However, in such applications it may be helpful to predict the "true" probability of fraud (or click, or default), and upsampling is thus unwanted.