Naive Bayesian classification (spam filtering) - Which calculation is right?

Question

I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.

In this method, you have to calculate

$alt text$

P(S|W) -> Probability that Message is spam given word W occurs in it.

P(W|S) -> Probability that word W occurs in a spam message.

P(W|H) -> Probability that word W occurs in a Ham message.

So to calculate P(W|S), which of the following is correct:

(Number of times W occurring in spam)/(total number of times W occurs in all the messages)
(Number of times word W occurs in Spam)/(Total number of words in the spam message)

So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure.)

I got to complete the implementation by this weekend :(

Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.

Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.

And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.

So total number of times W occurring in all the spam message = 5*50 = 250 times.

And total number of times W occurring in all Ham messages = 1*50 = 50 times.

Total occurrence of W in all of the training messages = (250+50) = 300 times.

So, in this scenario, how do you calculate P(W|S) and P(W|H) ?

Naturally we should expect, P(W|S) > P(W|H) right?

Is there any PHP implementation of Naive Bayes that used to find spam? — Hamza Zafeer

sth sth · Accepted Answer · 2010-05-13T15:54:48

5

votes

P(W|S) = (Number of spam messages containing W) / (Number of all spam messages)