24
votes

I have a logistic regression and a random forest and I'd like to combine them (ensemble) for the final classification probability calculation by taking an average.

Is there a built-in way to do this in sci-kit learn? Some way where I can use the ensemble of the two as a classifier itself? Or would I need to roll my own classifier?

4
You need to roll your own, there's no way to combine two arbitrary classifiers.Matti Lyra
There are several ongoing PRs and open issues on the sklearn github which are working towards having ensemble meta-estimators. Unfortunately none of them have been merged.Daniel
@user1507844 could you take a stab at a similar question here ? stackoverflow.com/questions/23645837/…ekta

4 Answers

34
votes

NOTE: The scikit-learn Voting Classifier is probably the best way to do this now


OLD ANSWER:

For what it's worth I ended up doing this as follows:

class EnsembleClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, classifiers=None):
        self.classifiers = classifiers

    def fit(self, X, y):
        for classifier in self.classifiers:
            classifier.fit(X, y)

    def predict_proba(self, X):
        self.predictions_ = list()
        for classifier in self.classifiers:
            self.predictions_.append(classifier.predict_proba(X))
        return np.mean(self.predictions_, axis=0)
4
votes

Given the same problem, I used a majority voting method. Combing probabilities/scores arbitrarily is very problematic, in that the performance of your different classifiers can be different, (For example, an SVM with 2 different kernels , + a Random forest + another classifier trained on a different training set).

One possible method to "weigh" the different classifiers, might be to use their Jaccard score as a "weight". (But be warned, as I understand it, the different scores are not "all made equal", I know that a Gradient Boosting classifier I have in my ensemble gives all its scores as 0.97, 0.98, 1.00 or 0.41/0 . I.E. it's very overconfident..)

4
votes

What about the sklearn.ensemble.VotingClassifier?

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

Per the description:

The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

2
votes

Now scikit-learn has StackingClassifier which can be used to stack multiple estimators.

from sklearn.datasets import load_iris  
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('lg', LogisticRegression()))
   ]
clf = StackingClassifier(
estimators=estimators, final_estimator=LogisticRegression()
)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train)
clf.predict_proba(X_test)