1
votes

I have an extremely large dataset and would like to train several random forest models on partitions of the dataset, then average these models to come up with my final classifier. Since random forest is an ensemble method, this is an intuitively sound approach but I'm unsure whether it's possible to do using scikit-learn's random forest classifier. Any ideas?

I'd also be open to using a random forest classifier from another package as well, just not sure where to look.

1
Why not train the base learners (trees) on partitions of the data, then average them up to come up with a single random forest?FatihAkici
Nice, that approach is practically equivalent given multiple trees are trained on the same partition. Do you know of any references I could use to build this out?hunter2

1 Answers

2
votes

Here is what I can think of:

  1. Pandas + Scikit: You can customize your own bootstrap algorithm where you randomly read a reasonably sized sample from the overall data set, and fit scikit trees on them (would be perfect if you randomize features at each node). Then pickle each tree and finally average them out to come up with your random forest.

  2. Graphlab + SFrame Turi has its own big data library (SFrame, similar to Pandas) and machine learning library (graphlab, very similar to scikit). Very beautiful environment.

  3. Blaze-Dask might have a little steeper learning curve for some people, but would be an efficient solution.

  4. You can go with the memory-mapped numpy option also but it's going to be more cumbersome than the first three options, and I've never done it so I'll just leave this option here.

All in all, I would go with option 2.