5
votes

OK I have recently discovered that the the scipy.spatial.distance.cdist command is very quick for solving a COMPLETE distance matrix between two vector arrays for source and destination. see: How can the euclidean distance be calculated with numpy? I wanted to try to duplicate those performance gains when solving the distance between two equal sized arrays. The distance between two SINGLE vectors is rather straight forward to calculate as shown in the previous link. We can take vectors:

    import numpy as np
    A=np.random.normal(size=(3))
    B=np.random.normal(size=(3))

and then use ´numpy.linalg.norm´ where

    np.linalg.norm(A-B)

is equivalent to

    temp = A-B
    np.sqrt(temp[0]**2+temp[1]**2+temp[2]**2)

which works nicely however when I want to know the distance between two sets of vectors where my_distance = distance_between( A[i], B[i] ) for all i the second solution works perfectly. In that as expected:

    A=np.random.normal(size=(3,42))
    B=np.random.normal(size=(3,42))     
    temp = A-B
    np.sqrt(temp[0]**2+temp[1]**2+temp[2]**2)

gives me a set of 42 distances between the ith element of A to the ith element of B. Whereas the norm function correctly calculates the norm for the entire matrix giving me a single value that is not what I'm looking for. The behaviour with the 42 distances is what I want to maintain, hopefully with nearly as much speed as I get from cdist for solving complete matrices. So the question is whats the most efficient way using python and numpy/scipy to calculate i distances between data with shape (n,i)?

Thanks, Sloan

2

2 Answers

3
votes

I think you already cracked most of the case yourself. Instead of your last line, however, I would use:

np.sqrt(np.sum(temp**2,0))
0
votes

Here are the timed comparisons for the two methods I think are most appropriate:

import timeit
In[19]:    timeit.timeit(stmt='np.linalg.norm(x-y,axis=0)', setup='import numpy as np; x,y = np.random.normal(size=(10, 100)), np.random.normal(size=(10, 100))', number=1000000)
Out[19]:   15.132534857024439

In[20]:    timeit.timeit(stmt='np.sqrt(np.sum((x-y),axis=1))', setup='import numpy as np; x,y = np.random.normal(size=(10, 100)), np.random.normal(size=(10, 100))', number=1000000)
Out[20]:   9.417887529009022

I'm not surprised the numpy method works faster. I believe as python improves, a lot of these builtin functions will be improved.

Tests were performed on anaconda python version 3.5.2