Undersampling vs class_weight in ScikitLearn Random Forests

Question

I am applying ScikitLearn's random forests on an extremely unbalanced dataset (ratio of 1:10 000). I can use the class_weigth='balanced' parameter. I have read it is equivalent to undersampling.

However, this method seems to apply weights to samples and do not change the actual number of samples.

Because each tree of the Random Forest is built on a randomly drawn subsample of the training set, I am afraid the minority class will not be representative enough (or not representated at all) in each subsample. Is this true? This would lead to very biased trees.

Thus, my question is: does the class_weight="balanced" parameter allows to build reasonably unbiased Random Forest models on extremely unbalanced datasets, or should I find a way to undersample the majority class at each tree or when building the training set?

Having a class without much representation is a danger in itself. You want enough examples of the minority class to be representative. That doesn't mean there's a benefit to under-sampling the majority class. — Arya McCarthy

Albgold Albgold · Accepted Answer · 2017-04-19T22:51:44

I think you can split majority class in +-10000 samples and train the same model using each sample plus the same points of minority class.

Undersampling vs class_weight in ScikitLearn Random Forests

1 Answers