6
votes

Perhaps this is too long-winded. Simple question about sklearn's random forest:

For a true/false classification problem, is there a way in sklearn's random forest to specify the sample size used to train each tree, along with the ratio of true to false observations?

More details are below:


In the R implementation of random forest, called randomForest, there's an option sampsize(). This allows you to balance the sample used to train each tree based on the outcome.

For example, if you're trying to predict whether an outcome is true or false and 90% of the outcomes in the training set are false, you can set sampsize(500, 500). This means that each tree will be trained on a random sample (with replacement) from the training set with 500 true and 500 false observations. In these situations, I've found models perform much better predicting true outcomes when using a 50% cut-off, yielding much higher kappas.

It doesn't seem like there is an option for this in the sklearn implementation.

  • Is there any way to mimic this functionality in sklearn?
  • Would simply optimizing the cut-off based on the Kappa statistic achieve a similar result or is something lost in this approach?
4

4 Answers

3
votes

In version 0.16-dev, you can now use class_weight="auto" to have something close to what you want to do. This will still use all samples, but it will reweight them so that classes become balanced.

2
votes

After reading over the documentation, I think that the answer is definitely no. Kudos to anyone who adds the functionality though. As mentioned above the R package randomForest contains this functionality.

0
votes

As far as I am aware, the scikit-learn forest employ bootstrapping i.e. the sample set sizes each tree is trained with are always of the same size and drawn from the original training set by random sampling with replacement.

Assuming you have a large enough set of training samples, why not balancing this itself out to hold 50/50 positive/negative samples and you will achieve the desired effect. scikit-learn provides functionality for this.

0
votes

Workaround in R only, for classification one can simply use all cores of the machine with 100% CPU utilization.

This matches the time and speed of Sklearn RandomForest classifier.

Also for regression there is a package RandomforestParallel on GitHub, which is much faster than Python Sklearn Regressor.

Classification: I have tested and works well.