Efficiently Calculating a Euclidean Dist Matrix in Numpy?

Question

I have a large array (~20k entries) of two dimension data, and I want to calculate the pairwise Euclidean distance between all entries. I need the output to have standard square form. Multiple solutions for this problem have been proposed, but none of them seem to work efficiently for large arrays.

The method using complex transposing fails for large arrays.

Scipy pdist seems to be the most efficient method using numpy. However, using squareform on the result to obtain a square matrix makes it very inefficient.

So the best I could come up with is using Scipy cdist, which is somewhat awkward, as it does calculate every pairwise distance twice. The provided time measurements show the advantage of pdist for the raw distance calculation.

Complex: 49.605 s

Cdist: 4.820 s

Pdist 1.785 s

Pdist with squareform 10.212 s

What is efficient in your mind? Do you have a specific time requirement you need to hit? You could probably beat cdist with a numba UDF specifically to calculate euclidean distance, but only by a little bit. If you need this to be faster, you may want to look at using a GPU. From Python, you could probably do this quite well with numba.cuda or CuPy. — Nick Becker
Actually, if you just need the upper or lower triangle, you could probably beat cdist by close to 2x with a numba UDF. — Nick Becker
What do you do next with the data? Can you reuse the output array? (memory allocation is also very costly). The only costly thing in this calculation is the square root... — max9111
Efficient in my mind is anything that comes a bit closer to a Rust implementation of this problem, which is a bit faster than pdist without applying squareform afterwards. The data is actually geospatial (a flat projection allows to use Euclidean distance). I'm running a dynamic programming algorithm on the data to find n points which maximise the sum of the straight distance between them, so I do need to access many indices. — Moritz
I have noticed that memory allocation at this kind of data size is very costly, but after calculating the distance matrix, I do not need to move the data anymore. As I will be running the program on a server, so I'm not really sure I it is worthwhile to optimise the code for a GPU, but it would probably speed it up quite a bit. — Moritz

Nick Becker Nick Becker · Accepted Answer · 2020-03-25T03:19:09

Since you've implied you don't need the full square matrix of results by noting that cdist is awkward because it calculates pairwise distances twice, you could use Numba to write a UDF that only calculates for the lower or upper triangle of the square matrix.

Note that the first time this is run there is overhead from the JIT compilation.

from scipy.spatial import distance
import pandas as pd
from numba import njit, prange
import numpy as np

@njit(parallel=True)
def euclidean_distance(coords1, coords2):
    # allocate output array
    c1_length, c2_length = len(coords1), len(coords2)
    out = np.empty(shape=(c1_length, c2_length), dtype=np.float64)

    # fill the lower triangle with euclidean distance formula
    # assuming coordiantes are (lat, lon) based on the example https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
    for lat_ix in prange(c1_length):
        for lon_ix in prange(c2_length):
            if lat_ix >= lon_ix: # do the reverse for the upper triangle
                out[lat_ix, lon_ix] = (
                    (coords1[lat_ix, 0] - coords2[lon_ix, 0]) ** 2
                    + (coords1[lat_ix, 1] - coords2[lon_ix, 1]) ** 2
                ) ** 0.5
            else:
                out[lat_ix, lon_ix] = 0
    return out


for n in [10, 100, 5000, 20000]:
    arr = np.random.normal(0, 100, (n, 2))
    print(n, arr.shape)

    %time out = euclidean_distance(arr, arr)
    %time out_cdist = distance.cdist(arr, arr, 'euclidean')

    if n < 1000:
        np.testing.assert_array_almost_equal(out, np.tril(out_cdist))
    print()

Output:

10 (10, 2)
CPU times: user 987 ms, sys: 19.3 ms, total: 1.01 s
Wall time: 1.01 s
CPU times: user 79 µs, sys: 12 µs, total: 91 µs
Wall time: 95.1 µs

100 (100, 2)
CPU times: user 1.05 ms, sys: 404 µs, total: 1.45 ms
Wall time: 1.16 ms
CPU times: user 926 µs, sys: 254 µs, total: 1.18 ms
Wall time: 946 µs

5000 (5000, 2)
CPU times: user 125 ms, sys: 128 ms, total: 253 ms
Wall time: 75 ms
CPU times: user 184 ms, sys: 92.6 ms, total: 277 ms
Wall time: 287 ms

20000 (20000, 2)
CPU times: user 2.21 s, sys: 2.15 s, total: 4.36 s
Wall time: 2.55 s
CPU times: user 3.1 s, sys: 2.71 s, total: 5.81 s
Wall time: 31.9 s

With a 20,000 element array, the UDF is quite a bit faster since it can save half of the computation. cdist seems particularly/unexpectedly slow for this specific distribution of data at scale on my Macbook Air, but the point is made regardless.

Efficiently Calculating a Euclidean Dist Matrix in Numpy?

3 Answers

The memory bandwidth is the limiting part in this problem