Python nltk Naive Bayes doesn't seem to work

Question

I'm using the nltk book - Natural Language Processing with Python(2009) and looking at the Naive Bayes classifier. In particular, Example 6-3 on Pg 228 in my version. The training set is movie reviews.

classifier = nltk.NaiveBayesClassifier.train(train_set)

I peek at the most informative features -

classifier.show_most_informative_features(5)

and I get 'outstanding', 'mulan' and 'wonderfully' among the top ranking ones for the sentence to be tagged 'positive'.

So, I try the following -

in1 = 'wonderfully mulan'
classifier.classify(document_features(in1.split()))

And I get 'neg'. Now this makes no sense. These were supposed to be the top features.

the document_features function is taken directly from the book -

def document_features(document): 
 document_words = set(document) 
 features = {}
 for word in word_features:
  features['contains(%s)' % word] = (word in document_words)
 return features

arturomp arturomp · Accepted Answer · 2013-11-28T08:00:33

Note that the feature vector in that example is comprised of the "2000 most frequent words in the overall corpus." So assuming that the corpus is comprehensive, a regular review will probably have quite a few of those words. (In real-world reviews of the latest Jackass movie and Dallas Buyers Club, I get 26/2000 and 28/2000 features respectively.)

If you feed it a review containing only "wonderfully mulan", the resulting feature vector only has 2/2000 features set to True. Basically, you're giving it a pseudoreview with little to no information that it knows about or that it can do anything with. For that vector, it's hard to tell what it will predict.

The feature vector should be healthily populated with vectors leaning in a positive direction for it to output pos. Maybe look at the most informative, say, 500 features, look at which ones lean positively and then create a string with only those? That might get you closer to pos, but not necessarily.

Some feature vectors in the train_set classify as pos. (Anecdotally, I found one of them to have 417 features equal to True). However, in my tests, no documents from the neg or pos training set partitions classified to pos, so while you may be right that the classifier doesn't seem to be doing a great job - at least the pos training examples should classify to pos - the example you're giving it is not a great measure of that.

Python nltk Naive Bayes doesn't seem to work

2 Answers