19
votes

I'm currently trying to generate a spam filter by analyzing a corpus I've amassed.

I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code.

I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki:

pr(S|W) = (pr(W|S)*pr(S))/(pr(W|S)*pr(S) + pr(W|H)*pr(H))

My PHP code:

public function pSpaminess($word)
{
    $ps = $this->pContentIsSpam();
    $ph = $this->pContentIsHam();
    $pws = $this->pWordInSpam($word);
    $pwh = $this->pWordInHam($word);
    $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
    return $psw;
}

In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess.

From the wiki formula:

p=(p1*pn)/((p1*pn)+(1-p)(1-pn))

My PHP code:

public function predict($content)
{
    $words = $this->tokenize($content);
    $pProducts = 1;
    $pSums = 1;
    foreach($words as $word)
    {
        $p = $this->pSpaminess($word);
        echo "$word: $p\n";
        $pProducts *= $p;
        $pSums *= (1 - $p);
    }
    return $pProducts / ($pProducts + $pSums);
}

On a test string "This isn't very bad at all.", the following output is produced:

C:\projects\bayes>php test.php
this: 0.19907407407407
isn't: 0.23
very: 0.2
bad: 0.2906976744186
at: 0.17427385892116
all: 0.16098484848485
probability message is spam: float(0.00030795502523944)

Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct?

My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places.

I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability.

If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases?


Additional Info

My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam.

Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham.

+-----------+------------+-----------+
| word      | spam_count | ham_count |
+-----------+------------+-----------+
| krugman   |         30 |        27 |
| fetus     |       12.5 |       7.5 |
| boehner   |         12 |        10 |
| hatred    |       11.5 |       5.5 |
| scum      |         11 |        10 |
| reserve   |         11 |        10 |
| incapable |        8.5 |       6.5 |
| socalled  |        8.5 |       5.5 |
| jones     |        8.5 |       7.5 |
| orgasms   |        8.5 |       7.5 |
+-----------+------------+-----------+

On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count.

+------+------------+-----------+
| word | spam_count | ham_count |
+------+------------+-----------+
| the  |       4884 |     17982 |
| to   |     4006.5 |   14658.5 |
| a    |     3770.5 |   14057.5 |
| of   |     3250.5 |   12102.5 |
| and  |       3130 |     11709 |
| is   |     3102.5 |   11032.5 |
| i    |     2987.5 |   10565.5 |
| that |     2953.5 |   10725.5 |
| it   |       2633 |      9639 |
| in   |     2593.5 |    9780.5 |
+------+------------+-----------+

As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam.

As suggested below, a test phrase on a post considered spam rates as follows:

Phrase

Cops are losers in general. That's why they're cops.

Analysis:

C:\projects\bayes>php test.php
cops: 0.15833333333333
are: 0.2218958611482
losers: 0.44444444444444
in: 0.20959269435914
general: 0.19565217391304
that's: 0.22080730418068
why: 0.24539170506912
they're: 0.19264544456641
float(6.0865969793861E-5)

According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment:

Phrase

Bill and TED's excellent venture?

Analysis

C:\projects\bayes>php test.php
bill: 0.19534050179211
and: 0.21093065570456
ted's: 1
excellent: 0.16091954022989
venture: 0.30434782608696
float(1)

Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation.

New test on known ham.

Phrase

Complain about $174,000 salary being too little for self.  Complain about $50,000 a year too much for teachers.
Scumbag congressman.

Analysis

C:\projects\bayes>php test.php
complain: 0.19736842105263
about: 0.21896031561847
174: 0.044117647058824
000: 0.19665809768638
salary: 0.20786516853933
being: 0.22011494252874
too: 0.21003236245955
little: 0.21134020618557
for: 0.20980452359022
self: 0.21052631578947
50: 0.19245283018868
a: 0.21149315683195
year: 0.21035386631717
much: 0.20139771283355
teachers: 0.21969696969697
scumbag: 0.22727272727273
congressman: 0.27678571428571
float(3.9604152477223E-11)

Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message.

Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages?


Final Update

As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments.

Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.

3
+1 for using math expressions! And code! And a full, well-written explanation. I wish I could upvote +10.wallyk
Hi Jeremy. Did you end up using this algorithm for spam filtering. I'm looking to do something similar but also getting inconsistent results.Paul Atkins
Hey Paul. I did this as an exercise, it never got used in anything. For what it is worth, I found that as mentioned below, the results matched my expectations more when I provided a corpus of equal ham/spam examples.Jeremy Giberson

3 Answers

2
votes

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

2
votes

If your filter is not biased (Pr(S)=Pr(H) = 0.5) then: "It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size."

This means that you should teach your Bayesian filter on the similar amount of spam and ham messages. Say 1000 spam messages and 1000 ham messages.

I'd assume (not checked) that if your filter is biased learning set should conform to the hypothesis about any message being spam.

0
votes

On the idea of compensating for message lengths, you could estimate for each set the probabilities of a message word being a specific word, then use a poisson distribution to estimate the probability of a message of N words containing that specific word.