21
votes

I have class imbalance problem and want to solve this using cost sensitive learning.

  1. under sample and over sample
  2. give weights to class to use a modified loss function

Question

Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.

2
I think that this largely depends upon the problem. Can you provide some more information about how imbalanced your data is, and perhaps what you've already tried?NBartley

2 Answers

15
votes

It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.

If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.

In more complex cases you can also try artificially generate samples, with techniques like SMOTE.

8
votes

sample_weight and class_weight have a similar function, that is to make your estimator pay more attention to some samples.

Actual sample weights will be sample_weight * weights from class_weight.

This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.

To sum it up:
class_weight and sample_weight both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.