3
votes

I am trying to implement a Nearest neighbour search for Lat and Lon data. Here is the Data.txt

61.3000183105 -21.2500038147 0
62.299987793 -23.750005722 1
66.3000488281 -28.7500038147 2
40.8000183105 -18.250005722 3
71.8000183105 -35.7500038147 3
39.3000183105 -19.7500019073 4
39.8000183105 -20.7500038147 5
41.3000183105 -20.7500038147 6

The problem is, when I want to do the nearest neighbour for each of the Lat and Lon on the data set, it is searching it self. e.g Nearest Neighbour of (-21.2500038147,61.3000183105) will be (-21.2500038147,61.3000183105) and the resulting distance will be 0.0. I am trying to avoid this but with no luck. I tried doing if not (array_equal) but still...

Below is my python code

import numpy as np
from numpy import *
import decimal
from scipy import spatial
from scipy.spatial import KDTree
from math import radians,cos,sin,sqrt,exp


Lat =[]
Lon =[]
Day =[]

nja = []


Data = np.loadtxt('Data.txt',delimiter=" ")
for i in range(0,len(Data)):
    Lon.append(Data[i][:][0])
    Lat.append(Data[i][:][1])
    Day.append(Data[i][:][2])   

tree =spatial.KDTree(zip(Lon,Lat) )

print "Lon  :",len(Lon)
print "Tree :",len(tree.data)

for i in range(0,len(tree.data)):
    pts = np.array([tree.data[i][0],tree.data[i][1]])
    nja.append(pts)

for i in range(0, len(nja)):
    if not (np.array_equal(nja,tree.data)):
    nearest = tree.query(pts,k=1,distance_upper_bound =9)
    print nearest
2

2 Answers

1
votes

For each point P[i] in your data set, you're asking "Which is the point nearest to P[i] in my data set?" and you get the answer "It is P[i]".

If you ask a different question, "Which are the TWO points nearest to P[i]?", i.e., tree.query(pts,k=2) (the difference with your code being s/k=1/k=2/) you will get P[i] and also a P[j], the second nearest point, that is the result you want.

Side note:

  • I'd recommend that you project your data before building the tree, cause in your range of latitudes there is a large fluctuation in what is meant by a 1 degree distance in longitude.
-1
votes

How'bout a low-tech solution? If you have a large number of points (say 10000 or more), this is no more reasonable, but for a smaller number this brute force solution might be useful:

 import numpy as np

 dist = (Lat[:,None]-Lat[None,:])**2 + (Lon[:,None]-Lon[None,:])**2

Now you have an NxN array (N is the number of points) with distances (or squares of distances, to be more precise) between all point pairs. Finding the shortest distance for each point is then a matter of finding the smallest value on each row. To exclude the point itself you may set the diagonal to NaN and use nanargmax:

np.fill_diagonal(dist, np.nan)
closest = np.nanargmin(dist, axis=1)

This approach is very simple and guaranteed to find the closest points, but has two significant downsides:

  1. It is O(n^2), and at 10000 points it takes around one second
  2. Ot consumes a lot of memory (800 MB for the above mentioned case)

The latter problem can of course be avoided by doing this piecewise, but the first problem excludes large point sets.


This can be carried out also by using scipy.spatial.distance.pdist:

dist=scipy.spatial.distance.pdist(np.column_stack((Lon, Lat)))

This is a bit faster (by half at least), but the output matrix is in the condensed form, see the documentation for scipy.spatial.distance.squareform.

If you need to calculate the real distances, then this is a good alternative, as pdist can handle distances on a sphere.


Then, again, you may use your KDtree approach by just extending your query to two closest point:

nearest = tree.query(pts, k=2, distance_upper_bound=9)

Then nearest[1][0] has the point itself ("me, myself, and I"), nearest[1][1] the real nearest neighbour (or inf if there is nothing near enough).

The best solution depends on the number of points you have. Also, you might want to use something else than cartesian 2D distances if your map points are not close to each other on the globe.


Just a note about using latitudes and longitudes in finding distances: If you just try to pretend they are 2D Cartesian points, you get it wrong. At 60°N one degree of latitude is 1111 km, whereas one degree of longitude is 555 km. So, at least you will have to divide the longitudes by cos(latitude). And even with that trick you will end up in trouble when the longitudes change from east to west.

Probably the easiest way out of this trouble is to calculate the coordinate points into Cartesian 3D points:

x = cos(lat) * cos(lon)
y = cos(lat) * sin(lon)
z = sin(lat)

If you then calculate the shortest distances between these points, you will get the correct results. (Just note that the distances are not the same as real shortest distances on the surface of the globe.)