I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance is the minimum Euclidean distance taken over all pairs of points P1-P2 where P1 is taken from cluster C1 and P2 is taken from cluster C2. If C1 has X points and C2 has Y points then this requires X*Y distance measurements.
I currently estimate this distance in a way that requires X+Y measurements:
- Find the centroid Ctr1 of cluster C1.
- Find the point P2 in cluster C2 that is closest to Ctr1. (Y comparisons.)
- Find the point P1 in C1 that is closest to P2. (X comparisons.)
- The distance from P1 to P2 is an approximate measure of the distance between the clusters C1 and C2. It is an upper-bound on the true value.
If clusters are roughly spherical, this works very well. My test data is composed of ellipsoidal Gaussian clusters, so it works very well. However, if clusters have weird, folded, bendy shapes, it may yield poor results. My questions are:
Is there an algorithm that uses even fewer than X+Y distance measurements and in the average case yields good accuracy?
OR
Is there an algorithm that (like mine) uses X+Y distance measurements but delivers better accuracy than mine?
(I am programming this in C#, but a description of an algorithm in pseudo-code or any other language is fine. Please avoid references to specialized library functions from R or Matlab. An algorithm with probabilistic guarantees like "95% chance that the distance is within 5% of the minimum value" is acceptable.)
NOTE: I just found this related question, which discusses a similar problem, though not necessarily for high dimensions. Given two (large) sets of points, how can I efficiently find pairs that are nearest to each other?
NOTE: I just discovered that this is called the bichromatic closest-pair problem.
For context, here is an overview of the overall clustering algorithm:
The first pass consolidates the densest regions into small clusters using a space-filling curve (Hilbert Curve). It misses outliers and often fails to merge adjacent clusters that are very close to one another. However, it does discover a characteristic maximum linkage-distance. All points separated by less than this characteristic distance must be clustered together. This step has no predefined number of clusters as its goal.
The second pass performs single-linkage agglomeration by combining clusters together if their minimum distance is less than the maximum linkage-distance. This is not hierarchical clustering; it is partition-based. All clusters whose minimum distance from one another is less than this maximum linkage-distance will be combined. This step has no predefined number of clusters as its goal.
The third pass performs additional single-linkage agglomeration, sorting all inter-cluster distances and only combining clusters until the number of clusters equals a predefined target number of clusters. It handles some outliers, preferring to only merge outliers with large clusters. If there are many outliers (and their usually are), this may fail to reduce the number of clusters to the target.
The fourth pass combines all remaining outliers with the nearest large cluster, but causes no large clusters to merge with other large clusters. (This prevents two adjacent clusters from accidentally being merged due to their outliers forming a thin chain between them.)