8
votes

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.

Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!

UPDATE: A couple of people have asked for more specific details on the problem. So here goes:

  • The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
  • The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
7
If it is that few points it is ok to go one by one.Andrey
Regardless of which algorithm you choose, you can save some time by comparing squared distances instead of actual distances. No need to perform square root operations if you don't need to know the actual distances.Lars Haugseth

7 Answers

5
votes

You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.

See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.

3
votes

You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment). Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box). I don't do Ruby so here is algorithm with pseudocode:

Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo). Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)

Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do

where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:

Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate. Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points). Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures. Note that we never computed Euclidean distance, we only used float multiplication and addition.

(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)

It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)

Pseudocode:

initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
    if D(X,P) <= √2 * Dmin:
        insert the tuple (X,D) in the priority-queue of candidates
        if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
2
votes

Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.

Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.

Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.

EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.

1
votes

Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.

Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.

1
votes

These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.

Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.

Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.

But be warned that these algorithms are not very easy to implement.

A good reference is:

  • de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
0
votes

Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.

When the user picks a point, you can use your stored data to give the results in constant time.

EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.

0
votes

if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:

  • take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
  • sort both lists

then, when given a point (X, Y),

  • bisection sort for X and Y
  • expand outwards on both lists, looking for common indices
  • for common indices, calculate exact distances
  • stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.

all you're doing is saying that a nearby point must have a similar x and a similar y value...