1
votes

I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.

In this method, you have to calculate

alt text

P(S|W) -> Probability that Message is spam given word W occurs in it.

P(W|S) -> Probability that word W occurs in a spam message.

P(W|H) -> Probability that word W occurs in a Ham message.

So to calculate P(W|S), which of the following is correct:

  1. (Number of times W occurring in spam)/(total number of times W occurs in all the messages)

  2. (Number of times word W occurs in Spam)/(Total number of words in the spam message)

So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure.)

I am referring http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way.

I got to complete the implementation by this weekend :(


Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.

Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.

And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.

So total number of times W occurring in all the spam message = 5*50 = 250 times.

And total number of times W occurring in all Ham messages = 1*50 = 50 times.

Total occurrence of W in all of the training messages = (250+50) = 300 times.

So, in this scenario, how do you calculate P(W|S) and P(W|H) ?

Naturally we should expect, P(W|S) > P(W|H) right?

3
Is there any PHP implementation of Naive Bayes that used to find spam?Hamza Zafeer

3 Answers

5
votes

P(W|S) = (Number of spam messages containing W) / (Number of all spam messages)

2
votes

Though it is quite old question, none of answers is complete, so it's worth to correct them.

Naive Bayes is not a single algorithm, but instead a family of algorithms, based on the same Bayes rule:

enter image description here

where C is a class (ham or spam in this example) and x with arrow is a vector of attributes (words in simplest case). P(C) is just proportion of messages of class C in the whole dataset. P(x) is probability of occurrence of message with attributes described by vector x, but since this parameter is the same for all classes we can omit it for the moment. But this question is about P(x|C), and how should one calculate it given vector x of current message?

Actually, answer depends on concrete type of NB algorithm. There are several of them, including Multivariate Bernoulli NB, Multivariate Gauss NB, Multinomial NB with numeric and boolean attributes and others. For details of calculating P(x|C) for each of them and also comparison of NB classifiers for the task of spam filtering see this paper.

1
votes

In this Bayesian formula, W is your "feature", i.e., the thing you observe.

You must carefully define first what is W. Often you have many alternatives.

Let's say that, in a first approach, you say W is the event "message contains the word Viagra". (That is to say, W have two possible values: 0 = "message does not contain the word V..." 1="message contains at least an occurrence of that word").

In that scenario, you're right: P(W|S) is "Probability that word W appears (at least once) in a spam message." And to estimate (better than "calculate") it, you count , as the other answer says, "(Number of spam messages containing at least one word V) / (Number of all spam messages)"

An alternative approach would be: define "W = number of ocurrences of word Viagra in a message". In this case, we should estimate P(W/S) for each value of W (P(W=0/S) P(W=1/S) P(W=2/S) ... More complicated, more samples needed, better (hopely) performance.