Complication using log-probabilities - Naive Bayes text classifier

Question

I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.

The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.

To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.

Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.

To give an example, lets say that I have 3 classes, 2000 words and 50 documents. Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.

The algorithm for the total word probability in this example would look like this:

sum = 0
for j in range(0,3):
    prob_product = 1
    for i in words:  #just the index of words from my vocabulary in this document
        prob_product = prob_product*wordprob[i,j]
    sum = sum + prob_product

What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.

Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.

Any help will be much appreciated.

dmuir dmuir · Accepted Answer · 2016-05-20T14:10:19

I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.

One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute

S = log( sum { i=1..n | exp( L[i])})

One way to do this is to find the maximum, M say, of the L's; a little algebra shows

S = M + log( sum { i=1..n | exp( L[i]-M)})

Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.

If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.

Complication using log-probabilities - Naive Bayes text classifier

3 Answers