I'm familiar with machine learning and Naive Bayes, but I am having some trouble understanding how to implement it for document classification where my feature vector is a bag of words.
In particular, how do you compute the class-conditional feature likelihood Pr(word | class)
? In many texts I see the following terminology:
How is the right-hand side implemented? Is it the count of documents of class c
in which the feature f
occurs divided by the count of documents of class c
?
For example, suppose you have 10 documents, where 7 are class C1 and 3 are class C2. The word "amazing" occurs in some of them:
C1: ...
C1: ... amazing ...
C1: ...
C1: ... amazing ...
C1: ... amazing ...
C1: ...
C1: ...
C2: ...
C2: ... amazing ...
C2: ...
It looks like:
- count(amazing, C1) = 3
- count(amazing, C2) = 1
- count(C1) = 7
- count(C2) = 3
Would Pr(amazing|C1) = 3/7 and Pr(amazing|C2) = 1/3?
Edit 5/7/2015
I came across a discussion of Naive Bayes for text classification in "Introduction to Information Retrieval" book, Chapter 13 (PDF). There is a different formulation for the class-conditional feature probability:
So, here it looks like count(word, class) is the total number of occurrences of the words in documents in the class rather then the number of documents in the class.
Likewise, count(class) is the total number of words in documents in the class, not the number of documents in the class.
Which formulation of P(feature|class) is preferred?