2
votes

Disclaimer: I am pretty new to AI, Python, NLTK and scikit-learn.

I am trying to train a classifier to classify a set of documents to a set of labels.

I am using the NLTK wrapper to talk to scikit-learn's OneVsRestClassifier.

training_set = [
    [{"car": True, ...}, "Label 1"],
    [{"car": False, ...}, "Label 2"],
    ...
    [{"car": False, ...}, "Label 1"],
]

ovr = SklearnClassifier(OneVsRestClassifier(MultinomialNB()))
ovr.train(training_set)

This works fine with Multi-class classification, where the classifier tries to classify documents only to a label. The accuracy is fine but I want the classifier to assign 0, 1 or more labels to the documents. How can I do that?

Sadly I can't just initialise the classifier telling it to be a multi-label classifier, the documentation says:

This strategy can also be used for multilabel learning, where a classifier is used to predict multiple labels for instance, by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.

This is not really clear to me as I am not familiar with this language. I have the feeling that I have to shape my training set in such a way that the classifier will understand that I want it to multi-label classify my data? If yes, how?

I tried to provide the labels in an array, like this:

training_set = [
    [{"car": True, ...}, ["Label 1"]],
    [{"car": False, ...}, ["Label 2"]],
    ...
    [{"car": False, ...}, ["Label 1"]],
]

This did not work as expected and raised:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
One-vs-rest accuracy percent: 0.0
2

2 Answers

2
votes

What documentation is trying to say is, use 2-D matrix for target. So basically, your training set can be,

training_set = [
    [{"car": True, ...}, [is_label_1, is_label_2, is_label_3]],
    [{"car": False, ...}, [is_label_1, is_label_2, is_label_3]],
    ...
    [{"car": False, ...}, [is_label_1, is_label_2, is_label_3]],
]

For a particular sample, train it with multiple labels, e.g. for 1st sample, if label 1 and label 3 are present, pass it as [1, 0, 1].

Hope, the answer is clear to you.

2
votes

I solved this by getting rid of the NLTK to scikit-learn adapter and by importing an NLTK module to help me convert my data structure to something feedable to the scikit-learn OneVsRestClassifier.

from nltk import compat
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier

_vectorizer = DictVectorizer(dtype=float, sparse=True)

def prepare_scikit_x_and_y(labeled_featuresets):
    X, y = list(compat.izip(*labeled_featuresets))
    X = _vectorizer.fit_transform(X)

    set_of_labels = []
    for label in y:
        set_of_labels.append(set(label))

    y = self.mlb.fit_transform(set_of_labels)

    return X, y

def train_classifier(labeled_featuresets):
    X, y = prepare_scikit_x_and_y(labeled_featuresets)
    classifier.fit(X, y)

training_set = [
    [{"car": True, ...}, ["Label 1"]],
    [{"car": False, ...}, ["Label 2"]],
    ...
    [{"car": False, ...}, ["Label 1"]],
]


ovr = OneVsRestClassifier(MultinomialNB())
ovr.train(training_set)

HAPPY BEANS