1
votes

I have written a simple multinomial Naive Bayes classifier in Python. The code predicts correct labels for BBC news dataset, but when I use a prior P(X) probability in denominator to output scores as probabilities, I get incorrect values (like > 1 for probability). Below I attach my code:

The entire process is based on this formula I learnt from the Wikipedia article about Naive Bayes:

enter image description here

  1. So, the first step is to extract features from the articles. I use Sklearn's count vectorizer for this purpose. It counts the number of occurrences for all words in vocabulary:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', min_df=5, ngram_range=(1,1) )
features = vectorizer.fit_transform(data.news).toarray()
print(features.shape)
(2225, 9138)

As a result, I get 9138 features for each article in the dataset.

  1. The next step is to calculate p(xi | Ck) for each label. It is given by the multinomial distribution formula:

enter image description here

I calculate pki as follows:

def count_word_probability(features):
  V_size = features.shape[1]
  alpha = 1
  total_counts_for_each_word = np.sum(features,axis=0)
  total_count_of_words = np.sum(total_counts_for_each_word)
  probs = (alpha + total_counts_for_each_word) / ( (V_size * alpha) + total_count_of_words)
  return probs

Basically, what this function does is computes the total frequency of each word in all articles with a particular label (e.g business) and divides by the total number of words in all articles with that label. It also applies Laplace smoothing (alpha = 1 ) to account for words with 0 frequency.

  1. Next, I compute p(Ck), a prior probability for labels. I simply divide the total number of articles in one category by the total number of articles in all categories:
labels_probs = [ len(data.index[data['category_id'] == i ]) / len(data) for i in range(5)]
  1. These are functions for scaling term and constant term (P(x) correspondingly:
import math as math
from scipy.special import factorial

def scaling_term(doc):
  term = math.factorial(np.sum(doc)) / np.prod(factorial(doc))
  return term 

Scaling function above divides the factorial of words sum in an article by the product of factorials.

def nb_constant (article, labels_probs, word_probs):
  s_term = scaling_term(article)
  evidence = [ np.log(s_term)  + np.sum(article * np.log(word_probs[i])) + np.log(labels_probs[i])  for i in range(len(word_probs))]
  evidence = np.sum(evidence)
  return evidence

So, the last function above calculates the denominator (prior probability P(x). It sums ups P(x|Ck) of all article classes:

enter image description here

  1. And the final Naive Bayes classifier looks like this:
def naive_bayes(article, label_probs, words_probs):
  class_probs = []
  s_term = scaling_term(article)
  constant_term = nb_constant(article, label_probs, words_probs)
  for cl in range(len(label_probs)):
    class_prob =  ( np.log(s_term) + np.sum(article * np.log(words_probs[cl])) + np.log(label_probs[cl]) )  / constant_term
    class_probs.append(class_prob)
  class_probs = np.exp(np.array(class_probs))
  return class_probs

Without a constant term, this function outputs correct label for any custom texts I feed to it. But the scores are all uniform and close to zero for all classes. When I divide by the constant term to get real probability values that sum up to zero I get weird results like 1.25 probability for all classes. I am definitely missing something in theory because I don't know much about probability theory and math. I would appreciate any help. Thanks.

1
Well, if the final per-class probabilities don't sum to 1, it means you've calculated the normalization factor incorrectly, since by definition 1/Z is the factor which makes the per-class probabilities sum to 1. The normalization should look like: Z = sum of non-normalized probabilities over k, then normalized probabilities = non-normalized / Z. It looks to me like you're on the right track, hang in there, I think you can figure it out.Robert Dodier
@RobertDodier Hi, Robert! Thanks for your response. Could you please explain this formula a little bit? What are non-normalized probabilities over k and what are normalized probabilities? I thought I just should use the same formula as in the numerator - P(Ck) * p(x|Ck) but sum it up over all classes.kirgol
It looks to me like you are taking logarithms to change multiplication into addition, which is OK, but you have to be careful. You have to apply 1/Z after converting back from log(p) to p, i.e., after taking exp. About computing Z, the simplest and most reliable way is to just sum over the array you want to normalize, adding up the elements as they are, and then divide each element by the sum. My advice is don't try to reproduce the same formula and sum over the formula -- just construct the array and then sum over the numbers in the array. Hope this helps!Robert Dodier
@RobertDodier thank you very much! It worked. I had to sum up over Z classes after taking the exponent of each class. That was the first mistake. The second mistake was that I had to divide by Z after exponents of the numerator were taken. Could you explain why this order? Is that because I cannot divide logs if logs of numerator and denominator are taken separately? Or could it work with subtraction? log (numerator) - log(denominator) ?kirgol
Also, if you want, you can post your answer to the question, pointing to some logarithm rules and how to be careful when using this formula?kirgol

1 Answers

1
votes

Thanks to Robert Dodier I figured out what was the issue. Before dividing by the constant (evidence) make sure to exponentiate numerator logs back to probabilities. Also, make sure to do exponentiation for all classes in the evidence term before summing them up.