I have written a simple multinomial Naive Bayes classifier in Python. The code predicts correct labels for BBC news dataset, but when I use a prior P(X) probability in denominator to output scores as probabilities, I get incorrect values (like > 1 for probability). Below I attach my code:
The entire process is based on this formula I learnt from the Wikipedia article about Naive Bayes:
- So, the first step is to extract features from the articles. I use Sklearn's count vectorizer for this purpose. It counts the number of occurrences for all words in vocabulary:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', min_df=5, ngram_range=(1,1) )
features = vectorizer.fit_transform(data.news).toarray()
print(features.shape)
(2225, 9138)
As a result, I get 9138 features for each article in the dataset.
- The next step is to calculate p(xi | Ck) for each label. It is given by the multinomial distribution formula:
I calculate pki as follows:
def count_word_probability(features):
V_size = features.shape[1]
alpha = 1
total_counts_for_each_word = np.sum(features,axis=0)
total_count_of_words = np.sum(total_counts_for_each_word)
probs = (alpha + total_counts_for_each_word) / ( (V_size * alpha) + total_count_of_words)
return probs
Basically, what this function does is computes the total frequency of each word in all articles with a particular label (e.g business) and divides by the total number of words in all articles with that label. It also applies Laplace smoothing (alpha = 1 ) to account for words with 0 frequency.
- Next, I compute p(Ck), a prior probability for labels. I simply divide the total number of articles in one category by the total number of articles in all categories:
labels_probs = [ len(data.index[data['category_id'] == i ]) / len(data) for i in range(5)]
- These are functions for scaling term and constant term (P(x) correspondingly:
import math as math
from scipy.special import factorial
def scaling_term(doc):
term = math.factorial(np.sum(doc)) / np.prod(factorial(doc))
return term
Scaling function above divides the factorial of words sum in an article by the product of factorials.
def nb_constant (article, labels_probs, word_probs):
s_term = scaling_term(article)
evidence = [ np.log(s_term) + np.sum(article * np.log(word_probs[i])) + np.log(labels_probs[i]) for i in range(len(word_probs))]
evidence = np.sum(evidence)
return evidence
So, the last function above calculates the denominator (prior probability P(x). It sums ups P(x|Ck) of all article classes:
- And the final Naive Bayes classifier looks like this:
def naive_bayes(article, label_probs, words_probs):
class_probs = []
s_term = scaling_term(article)
constant_term = nb_constant(article, label_probs, words_probs)
for cl in range(len(label_probs)):
class_prob = ( np.log(s_term) + np.sum(article * np.log(words_probs[cl])) + np.log(label_probs[cl]) ) / constant_term
class_probs.append(class_prob)
class_probs = np.exp(np.array(class_probs))
return class_probs
Without a constant term, this function outputs correct label for any custom texts I feed to it. But the scores are all uniform and close to zero for all classes. When I divide by the constant term to get real probability values that sum up to zero I get weird results like 1.25 probability for all classes. I am definitely missing something in theory because I don't know much about probability theory and math. I would appreciate any help. Thanks.