1
votes

I have a large array (~20k entries) of two dimension data, and I want to calculate the pairwise Euclidean distance between all entries. I need the output to have standard square form. Multiple solutions for this problem have been proposed, but none of them seem to work efficiently for large arrays.

The method using complex transposing fails for large arrays.

Scipy pdist seems to be the most efficient method using numpy. However, using squareform on the result to obtain a square matrix makes it very inefficient.

So the best I could come up with is using Scipy cdist, which is somewhat awkward, as it does calculate every pairwise distance twice. The provided time measurements show the advantage of pdist for the raw distance calculation.

Complex: 49.605 s

Cdist: 4.820 s

Pdist 1.785 s

Pdist with squareform 10.212 s

3
What is efficient in your mind? Do you have a specific time requirement you need to hit? You could probably beat cdist with a numba UDF specifically to calculate euclidean distance, but only by a little bit. If you need this to be faster, you may want to look at using a GPU. From Python, you could probably do this quite well with numba.cuda or CuPy.Nick Becker
Actually, if you just need the upper or lower triangle, you could probably beat cdist by close to 2x with a numba UDF.Nick Becker
What do you do next with the data? Can you reuse the output array? (memory allocation is also very costly). The only costly thing in this calculation is the square root...max9111
Efficient in my mind is anything that comes a bit closer to a Rust implementation of this problem, which is a bit faster than pdist without applying squareform afterwards. The data is actually geospatial (a flat projection allows to use Euclidean distance). I'm running a dynamic programming algorithm on the data to find n points which maximise the sum of the straight distance between them, so I do need to access many indices.Moritz
I have noticed that memory allocation at this kind of data size is very costly, but after calculating the distance matrix, I do not need to move the data anymore. As I will be running the program on a server, so I'm not really sure I it is worthwhile to optimise the code for a GPU, but it would probably speed it up quite a bit.Moritz

3 Answers

0
votes

Since you've implied you don't need the full square matrix of results by noting that cdist is awkward because it calculates pairwise distances twice, you could use Numba to write a UDF that only calculates for the lower or upper triangle of the square matrix.

Note that the first time this is run there is overhead from the JIT compilation.

from scipy.spatial import distance
import pandas as pd
from numba import njit, prange
import numpy as np

@njit(parallel=True)
def euclidean_distance(coords1, coords2):
    # allocate output array
    c1_length, c2_length = len(coords1), len(coords2)
    out = np.empty(shape=(c1_length, c2_length), dtype=np.float64)

    # fill the lower triangle with euclidean distance formula
    # assuming coordiantes are (lat, lon) based on the example https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
    for lat_ix in prange(c1_length):
        for lon_ix in prange(c2_length):
            if lat_ix >= lon_ix: # do the reverse for the upper triangle
                out[lat_ix, lon_ix] = (
                    (coords1[lat_ix, 0] - coords2[lon_ix, 0]) ** 2
                    + (coords1[lat_ix, 1] - coords2[lon_ix, 1]) ** 2
                ) ** 0.5
            else:
                out[lat_ix, lon_ix] = 0
    return out


for n in [10, 100, 5000, 20000]:
    arr = np.random.normal(0, 100, (n, 2))
    print(n, arr.shape)

    %time out = euclidean_distance(arr, arr)
    %time out_cdist = distance.cdist(arr, arr, 'euclidean')

    if n < 1000:
        np.testing.assert_array_almost_equal(out, np.tril(out_cdist))
    print()

Output:

10 (10, 2)
CPU times: user 987 ms, sys: 19.3 ms, total: 1.01 s
Wall time: 1.01 s
CPU times: user 79 µs, sys: 12 µs, total: 91 µs
Wall time: 95.1 µs

100 (100, 2)
CPU times: user 1.05 ms, sys: 404 µs, total: 1.45 ms
Wall time: 1.16 ms
CPU times: user 926 µs, sys: 254 µs, total: 1.18 ms
Wall time: 946 µs

5000 (5000, 2)
CPU times: user 125 ms, sys: 128 ms, total: 253 ms
Wall time: 75 ms
CPU times: user 184 ms, sys: 92.6 ms, total: 277 ms
Wall time: 287 ms

20000 (20000, 2)
CPU times: user 2.21 s, sys: 2.15 s, total: 4.36 s
Wall time: 2.55 s
CPU times: user 3.1 s, sys: 2.71 s, total: 5.81 s
Wall time: 31.9 s

With a 20,000 element array, the UDF is quite a bit faster since it can save half of the computation. cdist seems particularly/unexpectedly slow for this specific distribution of data at scale on my Macbook Air, but the point is made regardless.

0
votes

The memory bandwidth is the limiting part in this problem

At first try simple some simple memory operations to get some reference timings.

import numba as nb
import numpy as np
from scipy.spatial import distance

#Should be at least 0.47 (SVML-Bug)
print(nb.__version__)

@nb.njit(fastmath=True,parallel=True)
def dist_simply_write(res):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[0]):
            res[i,j]=1.
    return res

res_1=np.empty((A.shape[0],A.shape[0]))
res_2=np.empty((A.shape[0],A.shape[0]))

#Copying the array to a new array, which has to be allocated
%timeit res_2=np.copy(res_1)
#1.32 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Copying the array to a new array, which is already allocated
%timeit np.copyto(res_1,res_2)
#328 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#fill an array with 1., without calculating anything
%timeit out=dist_simply_write(A,res)
#246 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Does it take longer to calculate the euclidian distance instead of writing 1.?

@nb.njit(fastmath=True,parallel=True)
def dist_arr_1(A):
    res=np.empty((A.shape[0],A.shape[0]))
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[0]):
            acc=0
            for k in range(A.shape[1]):
                acc+=(A[i,k]-A[j,k])**2
            res[i,j]=np.sqrt(acc)
    return res

@nb.njit(fastmath=True,parallel=True)
def dist_arr_2(A,res):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[0]):
            acc=0
            for k in range(A.shape[1]):
                acc+=(A[i,k]-A[j,k])**2
            res[i,j]=np.sqrt(acc)
    return res

%timeit out=dist_arr_1(A)
#559 ms ± 85.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
res=np.empty((A.shape[0],A.shape[0]))

#If we can reuse the output memory
%timeit out=dist_arr_2(A,res)
#238 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you could see, it doesn't matter at all if we do a simple calculation (euclidean distance) or writing just a Number to the array. Calculating only half of the values and copying them afterwards is actually slower (no contiguous iteration in memory and reloading data).

0
votes

I tried both numpy broadcasting and scipy.spatial.distance.cdist and both seem to be similar when it comes to time efficiency:

import numpy as np
from scipy.spatial.distance import cdist
import time

def dist_numpy(a, b):
    d = np.linalg.norm(a[:, None, :] - b[None, :, :], axis=2)
    d = np.transpose(d)
    sorted_d = np.sort(d)
    sorted_ind = np.argsort(d)
    return sorted_d, sorted_ind

def dist_scipy(a, b):
    d = cdist(a, b, 'euclidean')
    d = np.transpose(d)
    sorted_d = np.sort(d)
    sorted_ind = np.argsort(d)
    return sorted_d, sorted_ind

def get_a_b(r=10**4,c=10** 1):
    a = np.random.uniform(-1, 1, (r, c)).astype('f')
    b = np.random.uniform(-1, 1, (r, c)).astype('f')
    return a,b

if __name__ == "__main__":
    a, b = get_a_b()
    st_t = time.time()
    #dist_numpy(a,b) # comment/ uncomment to execute the code! 
    dist_scipy(a,b) # comment/ uncomment to execute the code!
    print('it took {} s'.format(time.time()-st_t))