Optimizing histogram distance metric for two matrices in Python

Question

I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.

What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.

Here is my python code that does the job:

def chi_square(histA,histB):
   esp = 1.e-10
   d = sum((histA-histB)**2/(histA+histB+eps))
   return 0.5*d
def matrix_cost(A,B):
   a,_ = A.shape
   b,_ = B.shape
   C = zeros((a,b))
   for i in xrange(a):
      for j in xrange(b):
         C[i,j] = chi_square(A[i],B[j])
  return C

Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.

Is there any way to improve this performance?

I would appreciate any thoughts or recommendations.

Thank you.

Oliver W. Oliver W. · Accepted Answer · 2015-01-15T20:58:38

Sure! I'm assuming you're using numpy?

If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.

Here's how:

Abroad = A[:,np.newaxis,:]  # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.

Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.

A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:

def new_way(A,B):
    C = np.empty((A.shape[0],B.shape[0]))
    for rowind, row in enumerate(A):
        C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
    return C

This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).

You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):

import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B)  # same result

Optimizing histogram distance metric for two matrices in Python

1 Answers