2
votes

I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.

The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.

To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.

Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.

To give an example, lets say that I have 3 classes, 2000 words and 50 documents. Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.

The algorithm for the total word probability in this example would look like this:

sum = 0
for j in range(0,3):
    prob_product = 1
    for i in words:  #just the index of words from my vocabulary in this document
        prob_product = prob_product*wordprob[i,j]
    sum = sum + prob_product

What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.

Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.

Any help will be much appreciated.

3
NumPy has a logaddexp function for exactly this purpose.Mark Dickinson
(And scipy.misc.logsumexp may also be of interest.)Mark Dickinson

3 Answers

3
votes

I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.

One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute

S = log( sum { i=1..n | exp( L[i])})

One way to do this is to find the maximum, M say, of the L's; a little algebra shows

S = M + log( sum { i=1..n | exp( L[i]-M)})

Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.

If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.

1
votes

your question seems on the math side of things rather than the coding of it. I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps.. Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.

That's all i have for now. Sorry for the long post and no code.

0
votes

High school algebra tells you this:

log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)