Estimate the minimum Distance between two Clusters

Question

I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance is the minimum Euclidean distance taken over all pairs of points P1-P2 where P1 is taken from cluster C1 and P2 is taken from cluster C2. If C1 has X points and C2 has Y points then this requires X*Y distance measurements.

I currently estimate this distance in a way that requires X+Y measurements:

Find the centroid Ctr1 of cluster C1.
Find the point P2 in cluster C2 that is closest to Ctr1. (Y comparisons.)
Find the point P1 in C1 that is closest to P2. (X comparisons.)
The distance from P1 to P2 is an approximate measure of the distance between the clusters C1 and C2. It is an upper-bound on the true value.

If clusters are roughly spherical, this works very well. My test data is composed of ellipsoidal Gaussian clusters, so it works very well. However, if clusters have weird, folded, bendy shapes, it may yield poor results. My questions are:

Is there an algorithm that uses even fewer than X+Y distance measurements and in the average case yields good accuracy?

OR

Is there an algorithm that (like mine) uses X+Y distance measurements but delivers better accuracy than mine?

(I am programming this in C#, but a description of an algorithm in pseudo-code or any other language is fine. Please avoid references to specialized library functions from R or Matlab. An algorithm with probabilistic guarantees like "95% chance that the distance is within 5% of the minimum value" is acceptable.)

NOTE: I just found this related question, which discusses a similar problem, though not necessarily for high dimensions. Given two (large) sets of points, how can I efficiently find pairs that are nearest to each other?

NOTE: I just discovered that this is called the bichromatic closest-pair problem.

For context, here is an overview of the overall clustering algorithm:

The first pass consolidates the densest regions into small clusters using a space-filling curve (Hilbert Curve). It misses outliers and often fails to merge adjacent clusters that are very close to one another. However, it does discover a characteristic maximum linkage-distance. All points separated by less than this characteristic distance must be clustered together. This step has no predefined number of clusters as its goal.
The second pass performs single-linkage agglomeration by combining clusters together if their minimum distance is less than the maximum linkage-distance. This is not hierarchical clustering; it is partition-based. All clusters whose minimum distance from one another is less than this maximum linkage-distance will be combined. This step has no predefined number of clusters as its goal.
The third pass performs additional single-linkage agglomeration, sorting all inter-cluster distances and only combining clusters until the number of clusters equals a predefined target number of clusters. It handles some outliers, preferring to only merge outliers with large clusters. If there are many outliers (and their usually are), this may fail to reduce the number of clusters to the target.
The fourth pass combines all remaining outliers with the nearest large cluster, but causes no large clusters to merge with other large clusters. (This prevents two adjacent clusters from accidentally being merged due to their outliers forming a thin chain between them.)

No! It helps to know the "name" of the problem! Thanks! I will read the article. — Paul Chernoch
Interesting algorithms in the article. However, the recurcsive divide-and-conquer algorithm's dependence on D (number of dimensions) is a problem, because for me, D is often greater than K (number of clusters). I will study it further. — Paul Chernoch
Could you try to reduce dimensionality, or are all the dimensions approximately equally important? — kfx
@kfx - In my real world data, I identified redundant dimensions and reduced it from 40,000 to 950 dimensions. In subsequent processing, I weight zip codes by population, but in the clustering, all dimensions that are not redundant are considered equally important. — Paul Chernoch

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2016-01-06T17:37:35

You could use an index. That is the very classic solution.

A spatial index can help you find the nearest neighbor of any point in roughly O(log n) time. So if your clusters have n and m objects, choose the smaller cluster and index the larger cluster, to find the closest pair in O(n log m) or O(m log n).

A simpler heuristic approach is to iterate your idea multiple times, shrinking your set of candidates. So you find a good pair of objects a, b from the two clusters. Then you discard all objects from each cluster that must (by triangle inequality) be further apart (using the upper bound!). Then you repeat this, but not choosing the same a, b again. Once your candidate sets stops improving, do the pairwise comparisons on the remaining objects only. Worst case of this approach should remain O(n*m).

Estimate the minimum Distance between two Clusters

2 Answers