Feature importances - Bagging, scikit-learn

11

votes

For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision tree this does not look to be available.

My question: Does anybody know how to get the feature importances list for Bagging?

Greetings, Kornee

machine-learningscikit-learndecision-treefeature-selection

16

votes

Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no feature importances implemented. There are model-independent methods for computing feature importances (see e.g. https://github.com/scikit-learn/scikit-learn/issues/8898), scikit-learn doesn't use them.

In case of decision trees as base estimators you can compute feature importances yourselves: it'd be just an average of tree.feature_importances_ among all trees in bagging.estimators_:

import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = BaggingClassifier(DecisionTreeClassifier())
clf.fit(X, y)

feature_importances = np.mean([
    tree.feature_importances_ for tree in clf.estimators_
], axis=0)

RandomForestClassifer does the same computation internally.

2

votes

Extending what CharlesG posted, here's my solution for overloading the BaggingRegressor (same should work for BaggingClassifier).

class myBaggingRegressor(BaggingRegressor):
    def fit(self, X, y):
        fitd = super().fit(X, y)
        # need to pad features?
        if self.max_features == 1.0:
            # compute feature importances or coefficients
            if hasattr(fitd.estimators_[0], 'feature_importances_'):
                self.feature_importances_ =  np.mean([est.feature_importances_ for est in fitd.estimators_], axis=0)
            else:
                self.coef_ =  np.mean([est.coef_ for est in fitd.estimators_], axis=0)
                self.intercept_ =  np.mean([est.intercept_ for est in fitd.estimators_], axis=0)
        else:
            # need to process results into the right shape
            coefsImports = np.empty(shape=(self.n_features_, self.n_estimators), dtype=float)
            coefsImports.fill(np.nan)
            if hasattr(fitd.estimators_[0], 'feature_importances_'):
                # store the feature importances
                for idx, thisEstim in enumerate(fitd.estimators_):
                    coefsImports[fitd.estimators_features_[idx], idx] = thisEstim.feature_importances_
                # compute average
                self.feature_importances_ = np.nanmean(coefsImports, axis=1)
            else:
                # store the coefficients & intercepts
                self.intercept_ = 0
                for idx, thisEstim in enumerate(fitd.estimators_):
                    coefsImports[fitd.estimators_features_[idx], idx] = thisEstim.coefs_
                    self.intercept += thisEstim.intercept_
                # compute
                self.intercept /= self.n_estimators
                # average
                self.coefs_ = np.mean(coefsImports, axis=1)                
        return fitd

This correctly handles if max_features <> 1.0, though I suppose won't work exactly if bootstrap_features=True.

I suppose it's because sklearn has evolved a lot since 2017, but couldn't get it to work with the constructor, and it doesn't seem entirely necessary - the only reason to have that is to pre-specify the feature_importances_ attribute as None. However, it shouldn't even exist until fit() is called anyway.

1

votes

I encountered the same problem, and average feature importance was what I was interested in. Furthermore, I needed to have a feature_importance_ attribute exposed by (i.e. accessible from) the bagging classifier object. This was necessary to be used in another scikit-learn algorithm (i.e. RFE with an ROC_AUC scorer).

I chose to overload the BaggingClassifier, to gain a direct access to the mean feature_importance (or "coef_" parameter) of the base estimators.

Here is how to do so:

class BaggingClassifierCoefs(BaggingClassifier):
    def __init__(self,**kwargs):
        super().__init__(**kwargs)
        # add attribute of interest
        self.feature_importances_ = None
    def fit(self, X, y, sample_weight=None):
        # overload fit function to compute feature_importance
        fitted = self._fit(X, y, self.max_samples, sample_weight=sample_weight) # hidden fit function
        if hasattr(fitted.estimators_[0], 'feature_importances_'):
            self.feature_importances_ =  np.mean([tree.feature_importances_ for tree in fitted.estimators_], axis=0)
        else:
            self.feature_importances_ =  np.mean([tree.coef_ for tree in fitted.estimators_], axis=0)
    return(fitted)

Feature importances - Bagging, scikit-learn

3 Answers