I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match.
For example I have a piece of text:
"Theaters in New York compared to those in London"
And I have trained the algorithm to pick a place for every text snippet I feed it.
In the above example I would want it to return New York
and London
, but it only returns New York
.
Is it possible to use scikit-learn to return multiple results? Or even return the label with the next highest probability?
Thanks for your help.
---Update
I tried using OneVsRestClassifier
but I still only get one option back per piece of text. Below is the sample code I am using
y_train = ('New York','London')
train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')
X_vectorized = count.transform(train_set).todense()
smatrix2 = count.transform(test_set).todense()
base_clf = MultinomialNB(alpha=1)
clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred
Result: ['New York' 'London' 'London']