4
votes

I have a rather limited data set upon which I am performing supervised-learning, multi-class text classification using scikit-learn. To alleviate the shortage of information slightly, I wanted to do the following:

  1. Extract ngrams from the content I want to classify, merge it with the unigrams of the content and perform classification

  2. Implement (or use an existing implementation of) a vote-based ensemble classifier to improve classification accuracy. For example, both Multinomial Bayes and KNN seem to give good results for different classes: ideally I would combine these such that I get slightly better (and hopefully not worse) performance rather than the shoddy ~50% I am able to get using my limited data set.

While the first step is trivial, I cannot find much on how I would be able to do ensemble classification using scikit-learn. I've noted that that scikit-learn has some entries on ensemble classes such as this one, but it doesn't seem to be quite what I'm looking for.

Does anyone know of a concrete example of doing this using scikit-learn?

1
I don't think this can be done natively in scikit learn. There are several way of combining the output of several classifiers. If you post an example of expected input and output somebody can help you with the implementation.elyase

1 Answers

2
votes

I struggled with this question as well. After a lot of experimentation I found the best way to do an a ensemble classification in sci-kit was to average the clf.predict_proba(X) values of each trained model. The average performed better over the long term (runs of 50 or more) than any individual model

If you can guarantee that some of your trained models are stronger than others you may also want to look at using weighted averages or a multi armed bandit ensemble approach.

http://en.wikipedia.org/wiki/Multi-armed_bandit