1
votes

I have taken a look and try out the scikit-learn's tutorial on its Multinomial naive bayes classifier.

I want to use it to classify text documents, and the catch about the NB is that it treats its P(document|label) as a product of all its independent features (words). Right now, I need to try out doing 3 trigram classifier whereby the P(document|label) = P(wordX|wordX-1,wordX-2,label) * P(wordX-1|wordX-2,wordX-3, label).

Where scikit learn supports anything I can implement this language model and extend the NB classifier to perform classification based on this?

1

1 Answers

4
votes

CountVectorizer will extract trigrams for you (using ngram_range=(3, 3)). The text feature extraction documentation introduces this. Then, just use MultinomialNB exactly like before with the transformed feature matrix.

Note that this is actually modeling:

P(document | label) = P(wordX, wordX-1, wordX-2 | label) * P(wordX-1, wordX-2, wordX-3 | label) * ...

How different is that? Well, that first term can be written as

P(wordX, wordX-1, wordX-2 | label) = P(wordX | wordX-1, wordX-2, label) * P(wordX-1, wordX-2 | label)

Of course, all the other terms can be written that way too, so you end up with (dropping the subscripts and the conditioning on the label for brevity):

P(X | X-1, X-2) P(X-1 | X-2, X-3) ... P(3 | 2, 1) P(X-1, X-2) P(X-2, X-3) ... P(2, 1)

Now, P(X-1, X-2) can be written as P(X-1 | X-2) P(X-2). So if we do that for all those terms, we have

P(X | X-1, X-2) P(X-1 | X-2, X-3) ... P(3 | 2, 1) P(X-1 | X-2) P(X-2 | X-3) ... P(2 | 1) P(X-2) P(X-1) ... P(1)

So this is actually like using trigrams, bigrams, and unigrams (though not estimating the bigram/unigram terms directly).