4
votes

I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree. After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine different vectors into hierarchical clusters? I'm using heuristics: if two vectors linked, and distance between them is very small - that means that they are in the same cluster, if two wectors are linked but distance between them is larger than threshold - that means that they are in different clusters with common root cluster.

But maybe there is better solution?

Thanks

P.S. Thanks to all!

In fact I've tried to use k-means and some variation of CLOPE, but didn't get good results.

So, now I'm know that clusters of my dataset actually have complex structure (much more complex than n-spheres).

Thats why I want to use hierarchical clusterisation. Also I'm guess that clusters are looks like n-dimension concatenations (like 3d or 2d chain). So I use single-link strategy. But I'm disturbed - how to combine different clusters with each other (in which situation I've to make common root cluster, and in which situations I've to combine all sub-clusters in one cluster?). I'm using such simple strategy:

  • If clusters (or vectors) are too close to each other - I'm combine their content into one cluster (regulated by threshold)
  • If clusters (or vectors) are too far from each other - I'm creating root cluster and put them into it

But using this strategy I've got very large cluster trees. I'm trying to find satisfactory threshold. But maybe there might be better strategy to generate cluster-tree?

Here is a simple picture, describes my question:

enter image description here

2

2 Answers

4
votes

A lot of work has been done in this area. The usual advice is to start with K-means clustering unless you have a really good reason to do otherwise - but K-means does not do hierarchical clustering (normally anyway), so you may have a good reason to do otherwise (although it's entirely possible to do hierarchical K-means by doing a first pass to create clusters, then do another pass, using the centroid of each of those clusters as a point, and continuing until you have as few high-level clusters as desired).

There are quite a few other clustering models though, and quite a few papers covering relative strengths and weaknesses, such as the following:

  1. Pairwise Clustering and Graphical Models
  2. Beyond pairwise clustering
  3. Parallel pairwise clustering
  4. A fast greedy pairwise distance clustering. algorithm and its use in discovering thematic. structures in large data sets.
  5. Pairwise Clustering Algorithm
  6. Hierarchical Agglomerative Clustering

A little Googling will turn up lots more. Glancing back through my research directory from when I was working on clustering, I have dozens of papers, and my recollection is that there were a lot more that I looked at but didn't keep around, and many more still that I never got a chance to really even look at.

2
votes

There is a whole zoo of clustering algorithms. Among them, minimum spanning tree a.k.a. single linkage clustering has some nice theoretical properties, as noted e.g. at http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf. In particular, if you take a minimum spanning tree and remove all links longer than some threshold length, then the resulting grouping of points into clusters should have minimum total length of remaining links for any grouping of that size, for the same reason that Kruskal's algorithm produces a minimum spanning tree.

However, there is no guarantee that minimum spanning tree will be the best for your particular purpose, so I think you should either write down what you actually need from your clustering algorithm and then choose a method based on that, or try a variety of different clustering algorithms on your data and see which is best in practice.