I've coded up a Naive Bayes Classifier, but it doesn't seem to be working particularly well. Counting the words etc. is not a problem but the probabilities have been.
The method I've been using starts at page 180 in this book But I'll use the terms from the wiki article to make it more universal.
Training
With training I'm creating a probability for every word occurring in a category:
for category in categories:
for word in category_vocabulary[category]:
word_probability[category][word] = (category_vocabulary[category][word] + 1) / (total_words_in_category[category] + len(vocabulary))
So I get the total of number of times a word occurs in a category, add one, and then divide that by total words in a category, plus the size of the vocabulary (distinct words). This is P(xi|Ck)
I also calculate the probability of a category p(Ck), category_probability, which is simply the amount of words in a category divided by the words in all categories
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
Classifying
For classification I loop through all the tokens of the document to be classified, and calculate the product of word_probability for all the words in the text.
for category in categories:
if word in word_probability[category]:
if final_probability[category] == 0:
final_probability[category] = word_probability[category][word]
else:
final_probability[category] *= word_probability[category][word]
Lastly to calculate a score I multiply this by the category probability
score = category_probability[category] * final_probability[category]
This score seems to be completely wrong and I'm not sure what to do. When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
Any help would be much appreciated as I imagine what I'm doing wrong is somewhat obvious to somebody that better understands it.