I'm using the nltk book - Natural Language Processing with Python(2009) and looking at the Naive Bayes classifier. In particular, Example 6-3 on Pg 228 in my version. The training set is movie reviews.
classifier = nltk.NaiveBayesClassifier.train(train_set)
I peek at the most informative features -
classifier.show_most_informative_features(5)
and I get 'outstanding', 'mulan' and 'wonderfully' among the top ranking ones for the sentence to be tagged 'positive'.
So, I try the following -
in1 = 'wonderfully mulan'
classifier.classify(document_features(in1.split()))
And I get 'neg'. Now this makes no sense. These were supposed to be the top features.
the document_features function is taken directly from the book -
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features