OK I have recently discovered that the the scipy.spatial.distance.cdist
command is very quick for solving a COMPLETE distance matrix between two vector arrays for source and destination.
see: How can the euclidean distance be calculated with numpy?
I wanted to try to duplicate those performance gains when solving the distance between two equal sized arrays. The distance between two SINGLE vectors is rather straight forward to calculate as shown in the previous link. We can take vectors:
import numpy as np
A=np.random.normal(size=(3))
B=np.random.normal(size=(3))
and then use ´numpy.linalg.norm´ where
np.linalg.norm(A-B)
is equivalent to
temp = A-B
np.sqrt(temp[0]**2+temp[1]**2+temp[2]**2)
which works nicely however when I want to know the distance between two sets of vectors where my_distance = distance_between( A[i], B[i] ) for all i
the second solution works perfectly. In that as expected:
A=np.random.normal(size=(3,42))
B=np.random.normal(size=(3,42))
temp = A-B
np.sqrt(temp[0]**2+temp[1]**2+temp[2]**2)
gives me a set of 42 distances between the i
th element of A
to the i
th element of B
. Whereas the norm
function correctly calculates the norm for the entire matrix giving me a single value that is not what I'm looking for.
The behaviour with the 42 distances is what I want to maintain, hopefully with nearly as much speed as I get from cdist
for solving complete matrices. So the question is whats the most efficient way using python and numpy/scipy to calculate i
distances between data with shape (n,i)
?
Thanks, Sloan