Balanced Random Forest in scikit-learn (python)

Question

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.

RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.

mamafoku mamafoku · Accepted Answer · 2017-09-17T14:49:57

What you're looking for is the BalancedBaggingClassifier from imblearn.

imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
 n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
 bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
 replacement=False, n_jobs=1, random_state=None, verbose=0)

Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.

Balanced Random Forest in scikit-learn (python)

2 Answers