0
votes

This is an algorithm question.

Given 1 million points , each of them has x and y coordinates, which are floating point numbers.

Find the 10 closest points for the given point as fast as possible.

The closeness can be measured as Euclidean distance on a plane or other kind of distance on a globe. I prefer binary search due to the large number of points.

My idea:

   save the points in a database

1. Amplify x by a large integer e.g. 10^4 and cut off the decimal part and then Amplify x integer part by 10^4 again. 

2. Amplify y by a large integer e.g. 10^4 

3. Sum the above result from step 1 and 2 , we call the sum as associate_value

4. Repeat 1 to 3 for each number in the database

E.g.

    x = 12.3456789 , y = 98.7654321
    x times 10^4 = 123456 and then times 10^4 to get 1234560000
    y times 10^2 =  9876.54321 and then get 9876
    Sum them, get 1234560000 + 9876 = 1234569876 

In this way, I transform 2-d data to 1-d data. In the database, each point is associated with an integer (associate_value). The integer column can be set as index in the database for fast search.

For a given point (x, y), I perform step 1 - 3 for it and then find the points in the database such that their associate_value is close to the given point associate_value.

e.g.

    x = 59.469797 , y = 96.4976416
   their associated value is 5946979649 

Then in the database, I search the associate_values that are close to 5946979649, for example, 5946979649 + 50 , 5946979649 - 50 and also 5946979649 + 50000000 , 5946979649 - 500000000. This can be done by index-search in database.

In this way, I can find a group of points that are close to the given point. I can reduce the search space greatly. Then, I can use Euclidean or other distance formula to find the closest points.

I am not sure the efficiency of the algorithm, especially, the process of generating associate_values.

My idea works or not ? Any better ideas ?

Thanks

3
It seems like you have to go through each point multiple times, might as well just calculate the distance between the specified point and each other point once. And just keep the top 10 points in a list.Moop
Are they always going to be the same million points? Or will they be random each time you want to search?Moop
@Moop, the 1 million points are fixed. thxuser3601704
I would look into R-Trees if you are interested: en.wikipedia.org/wiki/R-treeMoop
The quickest way is just to have each point object reference its nearest 10 other points. Obviously that will take a lot of memory (10X) and probably take a while to generate at first, but the search times would then be O(1)Moop

3 Answers

0
votes

Your idea seems like it may work, but I would be concerned with degenerate cases (like if no points are in your specified ranges, but maybe that's not possible given the constraints). Either way, since you asked for other ideas, here's my stab at it: Store all of your points in a quad tree. Then just walk down the quad tree until you have a sufficiently small group to search through. Since the points are fixed, the cost of creating the quad is constant, and this should be logarithmic in the number of points you have.

0
votes

You can do better and just concatenate the binary value from the x- and y co-oordinates. Instead of a straight line it orders the points along a z-curve. Then you can compute the upper bounds with the mostsignificant bits. The z-curve is often use in mapping applications:http://msdn.microsoft.com/en-us/library/bb259689.aspx.

-1
votes

The way I read your algorithm you are discriminating the values along a line with a slope of -1 that are similar to your point. i.e. if your point is 2,2 you would look at points 1,3 0,4 and -1,5 and likely miss points closer. Most algorithms to solve this are O(n) which isn't terribly bad.

A simple algorithm to solve this problem is to keep a priority queue of the closest ten and a measurement of the furthest distance of the ten points as you iterate over the set. If the x or y value is not within the furthest distance discard it immediately. Otherwise calculate it with whatever distance measurement your using and see if it gets inserted into the queue. If so update your furthest on top ten threshold and continue iterating.

If your points are pre-sorted on one of the axes you can further optimize the algorithm by starting at the matching the point on that axis and radiate outward until you are at a difference greater than the distance from your tenth closest point. I did not include sorting in the description in the paragraph above because sorting is O(nlogn) which is slower than O(n). If you are doing this multiple times on the same set then it could be beneficial to sort it.