2
votes

I'm familiar with machine learning and Naive Bayes, but I am having some trouble understanding how to implement it for document classification where my feature vector is a bag of words.

In particular, how do you compute the class-conditional feature likelihood Pr(word | class)? In many texts I see the following terminology:

enter image description here

How is the right-hand side implemented? Is it the count of documents of class c in which the feature f occurs divided by the count of documents of class c?

For example, suppose you have 10 documents, where 7 are class C1 and 3 are class C2. The word "amazing" occurs in some of them:

C1: ...
C1: ... amazing ...
C1: ...
C1: ... amazing ...
C1: ... amazing ...
C1: ...
C1: ...
C2: ...
C2: ... amazing ...
C2: ...

It looks like:

  • count(amazing, C1) = 3
  • count(amazing, C2) = 1
  • count(C1) = 7
  • count(C2) = 3

Would Pr(amazing|C1) = 3/7 and Pr(amazing|C2) = 1/3?


Edit 5/7/2015

I came across a discussion of Naive Bayes for text classification in "Introduction to Information Retrieval" book, Chapter 13 (PDF). There is a different formulation for the class-conditional feature probability:

enter image description here

So, here it looks like count(word, class) is the total number of occurrences of the words in documents in the class rather then the number of documents in the class.

Likewise, count(class) is the total number of words in documents in the class, not the number of documents in the class.

Which formulation of P(feature|class) is preferred?

1
The answer to both of your questions is yes.Artem Sobolev

1 Answers

3
votes

Yes, your interpretation and example are correct. Count(f_i,c_i) consider all such events when f_i and c_i occur simultaneously, i.e. all docs of the class c_i with the feature f_i (presence of word, in that case, but in general it can be presence of at least 2 words or anything else).

Actually, the cited equation is a maximum likelihood estimation, see the paper The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm for the full description and proof.

Upd: as the same chapter states (see section 13.3), the first estimation is based on the Bernoulli model, while the second one corresponds to the multinomial model. The Bernoulli model is more applicable for short documents and "particularly sensitive to noise features", see the book, again, or the paper A comparison of eve nt models for Naive Bayes text classification (also taken from the book, section 13.7)