Disclaimer: I am pretty new to AI, Python, NLTK and scikit-learn.
I am trying to train a classifier to classify a set of documents to a set of labels.
I am using the NLTK wrapper to talk to scikit-learn's OneVsRestClassifier.
training_set = [
[{"car": True, ...}, "Label 1"],
[{"car": False, ...}, "Label 2"],
...
[{"car": False, ...}, "Label 1"],
]
ovr = SklearnClassifier(OneVsRestClassifier(MultinomialNB()))
ovr.train(training_set)
This works fine with Multi-class classification, where the classifier tries to classify documents only to a label. The accuracy is fine but I want the classifier to assign 0, 1 or more labels to the documents. How can I do that?
Sadly I can't just initialise the classifier telling it to be a multi-label classifier, the documentation says:
This strategy can also be used for multilabel learning, where a classifier is used to predict multiple labels for instance, by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.
This is not really clear to me as I am not familiar with this language. I have the feeling that I have to shape my training set in such a way that the classifier will understand that I want it to multi-label classify my data? If yes, how?
I tried to provide the labels in an array, like this:
training_set = [
[{"car": True, ...}, ["Label 1"]],
[{"car": False, ...}, ["Label 2"]],
...
[{"car": False, ...}, ["Label 1"]],
]
This did not work as expected and raised:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
One-vs-rest accuracy percent: 0.0