0
votes

I've coded up a Naive Bayes Classifier, but it doesn't seem to be working particularly well. Counting the words etc. is not a problem but the probabilities have been.

The method I've been using starts at page 180 in this book But I'll use the terms from the wiki article to make it more universal.

Training

With training I'm creating a probability for every word occurring in a category:

for category in categories:
    for word in category_vocabulary[category]:
        word_probability[category][word] = (category_vocabulary[category][word] + 1) / (total_words_in_category[category] + len(vocabulary))

So I get the total of number of times a word occurs in a category, add one, and then divide that by total words in a category, plus the size of the vocabulary (distinct words). This is P(xi|Ck)

I also calculate the probability of a category p(Ck), category_probability, which is simply the amount of words in a category divided by the words in all categories

for category in categories:
    category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())

Classifying

For classification I loop through all the tokens of the document to be classified, and calculate the product of word_probability for all the words in the text.

for category in categories:
    if word in word_probability[category]:
        if final_probability[category] == 0:
            final_probability[category] = word_probability[category][word]
        else:
            final_probability[category] *= word_probability[category][word]

Lastly to calculate a score I multiply this by the category probability

score = category_probability[category] * final_probability[category]

This score seems to be completely wrong and I'm not sure what to do. When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.

Any help would be much appreciated as I imagine what I'm doing wrong is somewhat obvious to somebody that better understands it.

1

1 Answers

1
votes

This score seems to be completely wrong and I'm not sure what to do.

First of all, category probability is not estimated by the fraction of words in a category vs. total number of words

for category in categories:
    category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())

but numnber of sentences in a category vs total number of sentences (or paragraphs, documents, objects - the thing you are classifing). Thus

for category in categories:
    category_probability[category] = total_objects_in_category[category] / sum(total_objects_in_category.values())

When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.

This is because direct probability computation (which you do) is numerically unstable. You will end up multipling lots of tiny numbers, thus precision will fall exponentialy. Consequently one uses this simple mathematical equality:

PROD_i P(x) = exp [ log [ PROD_i P_i(x) ] ] = exp [ SUM_i log P_i(X) ]

Thus instead of storing probabilities you store logarithms of probabilities, and instead of multiplying them, you sum them. If you want to recover true probability all you have to do is take exp value, but for classification you do not have to, as P(x) > P(y) <-> log P(x) > log P(y)