I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.
In this method, you have to calculate
P(S|W) -> Probability that Message is spam given word W occurs in it.
P(W|S) -> Probability that word W occurs in a spam message.
P(W|H) -> Probability that word W occurs in a Ham message.
So to calculate P(W|S), which of the following is correct:
(Number of times W occurring in spam)/(total number of times W occurs in all the messages)
(Number of times word W occurs in Spam)/(Total number of words in the spam message)
So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure.)
I am referring http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way.
I got to complete the implementation by this weekend :(
Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.
Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.
And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.
So total number of times W occurring in all the spam message = 5*50 = 250 times.
And total number of times W occurring in all Ham messages = 1*50 = 50 times.
Total occurrence of W in all of the training messages = (250+50) = 300 times.
So, in this scenario, how do you calculate P(W|S) and P(W|H) ?
Naturally we should expect, P(W|S) > P(W|H)
right?