9
votes

Main question: How do I combine different randomForests in python and scikit-learn?

I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. This is to address a classification problem.

Since my input data is too large to fit in memory on one machine, I sample the data into smaller data sets and generate random forest object which contains a smaller set of trees. I then combine the different trees together using a modified combine function to create a new random forest object. This random forest object contains the feature importance and final set of trees. This does not include the oob errors or votes of the trees.

While this works well in R, I want to do the same thing in Python using scikit-learn. I can create different random forest objects, but I don't have any way to combine them together to form a new object. Can anyone point me to a function that can combine the forests? Is this possible using scikit-learn?

Here is the link to a question on how to this process in R:Combining random forests built with different training sets in R .

Edit: The resulting random forest object should contain the trees that can be used for prediction and also the feature importance.

Any help would be appreciated.

2
If the goal is prediction then there is no necessity to combine different models. You can make prediсtion by separate models and then combine results only.DrDom
Agree with @DrDom, there are many ways to ensemble models. Details on how you want to do it are pretty important.David
@DrDom I agree that if it was just predictions then I can combine the results. But, I am interested in not only predictions but also the variable importance of the features.reddy
@reddy, variable importance is an average change in prediction error while the variable is shuffled. Thus the average importance across separate models should be approximately equal to the value of variable importance for ensemble of random forests. This is valid if variable importance values were not previously scaled or modified in other ways. Anyway variable importance is not a fixed number since its value depends on the random numbers. UPD: if the number of trees in each model is different then when you calc the average importance there is a need to took this into account.DrDom

2 Answers

8
votes

Sure, just aggregate all the trees, for instance have look at this snippet from pyrallel:

def combine(all_ensembles):
    """Combine the sub-estimators of a group of ensembles

        >>> from sklearn.datasets import load_iris
        >>> from sklearn.ensemble import ExtraTreesClassifier
        >>> iris = load_iris()
        >>> X, y = iris.data, iris.target

        >>> all_ensembles = [ExtraTreesClassifier(n_estimators=4).fit(X, y)
        ...                  for i in range(3)]
        >>> big = combine(all_ensembles)
        >>> len(big.estimators_)
        12
        >>> big.n_estimators
        12
        >>> big.score(X, y)
        1.0

    """
    final_ensemble = copy(all_ensembles[0])
    final_ensemble.estimators_ = []

    for ensemble in all_ensembles:
        final_ensemble.estimators_ += ensemble.estimators_

    # Required in old versions of sklearn
    final_ensemble.n_estimators = len(final_ensemble.estimators_)

    return final_ensemble
2
votes

Based on your edit, it sounds like you're only asking for how to extract feature importance and look at the individual trees used in a random forest. If so, both of these are attributes of your random forest model named "feature_importances_" and "estimators_" respecitively. An example illustrating this can be found below:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)
>>> clf = RandomForestClassifier(n_estimators=5, max_depth=None, min_samples_split=1, random_state=0)
>>> clf.fit(X,y)
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=1,
            n_estimators=5, n_jobs=1, oob_score=False, random_state=0,
            verbose=0)
>>> clf.feature_importances_
array([ 0.09396245,  0.07052027,  0.09951226,  0.09095071,  0.08926362,
        0.112209  ,  0.09137607,  0.11771107,  0.11297425,  0.1215203 ])
>>> clf.estimators_
[DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b408>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b3f0>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b420>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b438>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b450>,
            splitter='best')]