8
votes

I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance is the minimum Euclidean distance taken over all pairs of points P1-P2 where P1 is taken from cluster C1 and P2 is taken from cluster C2. If C1 has X points and C2 has Y points then this requires X*Y distance measurements.

I currently estimate this distance in a way that requires X+Y measurements:

  1. Find the centroid Ctr1 of cluster C1.
  2. Find the point P2 in cluster C2 that is closest to Ctr1. (Y comparisons.)
  3. Find the point P1 in C1 that is closest to P2. (X comparisons.)
  4. The distance from P1 to P2 is an approximate measure of the distance between the clusters C1 and C2. It is an upper-bound on the true value.

If clusters are roughly spherical, this works very well. My test data is composed of ellipsoidal Gaussian clusters, so it works very well. However, if clusters have weird, folded, bendy shapes, it may yield poor results. My questions are:

Is there an algorithm that uses even fewer than X+Y distance measurements and in the average case yields good accuracy?

OR

Is there an algorithm that (like mine) uses X+Y distance measurements but delivers better accuracy than mine?

(I am programming this in C#, but a description of an algorithm in pseudo-code or any other language is fine. Please avoid references to specialized library functions from R or Matlab. An algorithm with probabilistic guarantees like "95% chance that the distance is within 5% of the minimum value" is acceptable.)

NOTE: I just found this related question, which discusses a similar problem, though not necessarily for high dimensions. Given two (large) sets of points, how can I efficiently find pairs that are nearest to each other?

NOTE: I just discovered that this is called the bichromatic closest-pair problem.

For context, here is an overview of the overall clustering algorithm:

  1. The first pass consolidates the densest regions into small clusters using a space-filling curve (Hilbert Curve). It misses outliers and often fails to merge adjacent clusters that are very close to one another. However, it does discover a characteristic maximum linkage-distance. All points separated by less than this characteristic distance must be clustered together. This step has no predefined number of clusters as its goal.

  2. The second pass performs single-linkage agglomeration by combining clusters together if their minimum distance is less than the maximum linkage-distance. This is not hierarchical clustering; it is partition-based. All clusters whose minimum distance from one another is less than this maximum linkage-distance will be combined. This step has no predefined number of clusters as its goal.

  3. The third pass performs additional single-linkage agglomeration, sorting all inter-cluster distances and only combining clusters until the number of clusters equals a predefined target number of clusters. It handles some outliers, preferring to only merge outliers with large clusters. If there are many outliers (and their usually are), this may fail to reduce the number of clusters to the target.

  4. The fourth pass combines all remaining outliers with the nearest large cluster, but causes no large clusters to merge with other large clusters. (This prevents two adjacent clusters from accidentally being merged due to their outliers forming a thin chain between them.)

2
did you try something like that ?Borbag
No! It helps to know the "name" of the problem! Thanks! I will read the article.Paul Chernoch
Interesting algorithms in the article. However, the recurcsive divide-and-conquer algorithm's dependence on D (number of dimensions) is a problem, because for me, D is often greater than K (number of clusters). I will study it further.Paul Chernoch
Could you try to reduce dimensionality, or are all the dimensions approximately equally important?kfx
@kfx - In my real world data, I identified redundant dimensions and reduced it from 40,000 to 950 dimensions. In subsequent processing, I weight zip codes by population, but in the clustering, all dimensions that are not redundant are considered equally important.Paul Chernoch

2 Answers

0
votes

You could use an index. That is the very classic solution.

A spatial index can help you find the nearest neighbor of any point in roughly O(log n) time. So if your clusters have n and m objects, choose the smaller cluster and index the larger cluster, to find the closest pair in O(n log m) or O(m log n).

A simpler heuristic approach is to iterate your idea multiple times, shrinking your set of candidates. So you find a good pair of objects a, b from the two clusters. Then you discard all objects from each cluster that must (by triangle inequality) be further apart (using the upper bound!). Then you repeat this, but not choosing the same a, b again. Once your candidate sets stops improving, do the pairwise comparisons on the remaining objects only. Worst case of this approach should remain O(n*m).

0
votes

I have found a paper that describes a linear time, randomized, epsilon-approximate algorithm for the closest bichromatic point problem:

http://www.cs.umd.edu/~samir/grant/cp.pdf

I will attempt to implement it and see if it works.

UPDATE - After further study, it is apparent that the runtime is proportional to 3^D, where D is the number of dimensions. This is unacceptable. After trying several other approaches, I hit on the following.

  1. Perform a rough clustering into K clusters using an efficient but incomplete method. This method will properly cluster some points, but yield too many clusters. These small clusters remain to be further consolidated to form larger clusters. This method will determine an upper bound distance DMAX between points that are considered to be in the same cluster.
  2. Sort the points in Hilbert curve order.
  3. Throw out all points immediately preceded and succeeded by a neighbor from the same cluster. More often than not, these are interior points of a cluster, not surface points.
  4. For each point P1, search forward, but no farther than the next point from the same cluster.
  5. Compute the distance from the point P1 from cluster C1 to each visited point P2 from cluster C2 and record the distance if it is smaller than any prior distance measured between points in C1 and C2.
  6. However, if P1 has already been compared to a point in C2, do not do so again. Only make a single comparison between P1 and any point in C2.
  7. After all comparisons have been made, there will be at most K(K-1) distances recorded, and many discarded because they are larger than DMAX. These are estimated closest point distances.
  8. Perform merges between clusters if they are nearer than DMAX.

It is hard to conceive of how the Hilbert curve wiggles among the clusters, so my estimate of how efficient this approach to finding closest pairs was that it was proportional to K^2. However, my testing shops that it is closer to K. It may be around K*log(K). Further research is necessary.

As for accuracy:

  • Comparing every point to every other point is 100% accurate.
  • Using the centroid method outlined in my question has distances that are about 0.1% too high.
  • Using this method finds distances that are at worst 10% too high, and on average 5% too high. However, the true closest cluster almost always turns up as among the first through third closest cluster, so qualitatively it is good. The final clustering results using this method are excellent. My final clustering algorithm seems to be proportional to DNK or DNK*Log(K).