12
votes

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.

RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.

2
We're working on it. imblearn is a good solution for now.Andreas Mueller

2 Answers

12
votes

What you're looking for is the BalancedBaggingClassifier from imblearn.

imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
 n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
 bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
 replacement=False, n_jobs=1, random_state=None, verbose=0)

Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.

2
votes

There is now a class in imblearn called BalancedRandomForestClassifier. It works similar to previously mentioned BalancedBaggingClassifier but is specifically for random forests.

from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)