cluster analysis - Clustering 500,000 geospatial points in python

Question

I'm currently faced with the problem of finding a way to cluster around 500,000 latitude/longitude pairs in python. So far I've tried computing a distance matrix with numpy (to pass into the scikit-learn DBSCAN) but with such a large input it quickly spits out a Memory Error.

The points are stored in tuples containing the latitude, longitude, and the data value at that point.

In short, what is the most efficient way to spatially cluster a large number of latitude/longitude pairs in python? For this application, I'm willing to sacrifice some accuracy in the name of speed.

Edit: The number of clusters for the algorithm to find is unknown ahead of time.

Quadtree? There is a library PyQuadTree that exists if you'd like to try that route. — Cory Kramer
maybe I'm displaying ignorance, but was is your criteria for # of clusters? (And snide comment, if you set number of clusters to 1, there's an O(1) algorithm to do it) — Foon
The K means algorithm requires you to give it the number of clusters to find as a parameter of the algorithm - as opposed to DBSCAN or OPTICS which determine the clusters based on other criteria. Also, although O(1) is almost worth it, unfortunately there is usually more than a single cloud system between the north and the south pole haha. — user3681226

gobrewers14 gobrewers14 · Accepted Answer · 2014-06-03T21:55:05

I don't have your data so I just generated 500k random numbers into three columns.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten

arr = np.random.randn(500000*3).reshape((500000, 3))
x, y = kmeans2(whiten(arr), 7, iter = 20)  #<--- I randomly picked 7 clusters
plt.scatter(arr[:,0], arr[:,1], c=y, alpha=0.33333);

out[1]:

enter image description here

I timed this and it took 1.96 seconds to run this Kmeans2 so I don't think it has to do with the size of your data. Put your data in a 500000 x 3 numpy array and try kmeans2.

cluster analysis - Clustering 500,000 geospatial points in python

2 Answers