2
votes

I am trying to use the OneVsRestClassifier to do multilabel classification on a set of comments. My objective is to tag each comment to a possible list of topics. My custom classifier uses a manually curated list of words and their corresponding tags in a csv to tag each comment. I am trying to combine the results obtained from the Bag of Words technique and my custom classifier using the VotingClassifier. Here is part of my existing code:

import numpy as np

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer

class CustomClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, word_to_tag):
        self.word_to_tag = word_to_tag

    def fit(self, X, y=None):
        return self

    def predict_proba(self, X):
        prob = np.zeros(shape=(len(self.word_to_tag), 2))

        for index, comment in np.ndenumerate(X):
            prob[index] = [0.5, 0.5]
            for word, label in self.word_to_tag.iteritems():
                if (label == self.class_label) and (comment.find(word) >= 0):
                    prob[index] = [0, 1]
                    break

        return prob

    def _get_label(self, ...):
        # Need to have a way of knowing which label being classified
        # by OneVsRestClassifier (self.class_label)

bow_clf = Pipeline([('vect', CountVectorizer(stop_words='english', min_df=1, max_df=0.9)), 
                    ('tfidf', TfidfTransformer(use_idf=False)),
                    ('clf', SGDClassifier(loss='log', penalty='l2', alpha=1e-3, n_iter=5)),
                   ])
custom_clf = CustomClassifier(word_to_tag_dict)

ovr_clf = OneVsRestClassifier(VotingClassifier(estimators=[('bow', bow_clf), ('custom', custom_clf)],
                                               voting='soft'))

params = { 'estimator_weights': ([1, 1], [1, 2], [2, 1]) }
gs_clf = GridSearchCV(ovr_clf, params, n_jobs=-1, verbose=1, scoring='precision_samples')

binarizer = MultiLabelBinarizer()

gs_clf.fit(X, binarizer.fit_transform(y))

My intention is to use this manually curated list of words obtained by several heuristics to improve the results obtained by solely applying bag of words. Currently I am struggling to find a way to know which label is being is classified while predicting, since a copy of CustomClassifier is created for each label using OneVsRestClassifier.

1
self.class_label seems undefined to me. I'm not sure what you mean by "which label is being classified", Labels are predicted from the data. - Manoj
Yes, my question is basically how to determine what the self.class_label is? So when OneVsRestClassifier fits the data, it clones the estimator (github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/…) for each label being classified (github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/…) if you are doing multilabel classification. So, basically what I need is a way for me to determine which label the cloned CustomClassifier corresponds to when calculating predict_proba. - dshah
To describe further, let me give you an example. Lets say the comment is "The food at this restaurant was great. The service at this restaurant was also phenomenal." And let's say that I was working with the labels ["food", "staff", "location", "other", ...]. Then, OneVsRestClassifier creates a clone of the VotingClassifier in this case for each label. This recursively also makes a copy of CustomClassifier for each label. But I don't know how I would determine which label the specific instance of CustomClassifier corresponds to. - dshah

1 Answers