24
votes

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.

hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).

I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.

2
are your points in 2d, 3d, 10d, 128d ?denis
@Denis I dont understand what you mean by that. AS such, the limitation seems to stem from the fact that a nxn distance matrix for 1M objects cant fit into memory, and each of the clustering libraries I stated above (orange and scipy) take an in memory distance matrix as input (which is not possible to provide as input for 1M objects...)Atish Kathpal
the points/objects are simple text files, that I am trying to cluster based on the text they contain.... can you also explain me if this is 2d or what? thanks.Atish Kathpal
@user940154 The idea of 2d or nd is in which feature dimension you operate? So, if you attach to each of the word 'length' and 'starting letter' features, that would be 2d feature space. Highdimensional and sparsed data requires special data structures to do efficient clustering.om-nom-nom

2 Answers

15
votes

The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3) time anyway.

You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.

If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2) implementation of SLINK. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK, too. And I'm not sure if there actually is any sub-O(n^3) algorithm for other variants than single-link and complete-link.

Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN and OPTICS are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n), if you have an index with O(log n) query time. Which can make a huge difference for large data sets. I've personally used OPTICS on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.

11
votes

To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:

  • build a hierarchical tree from say 15k points, then add the rest one by one: time ~ 1M * treedepth

  • first build 100 or 1000 flat clusters, then build your hierarchical tree of the 100 or 1000 cluster centres.

How well either of these might work depends critically on the size and shape of your target tree -- how many levels, how many leaves ?
What software are you using, and how many hours / days do you have to do the clustering ?

For the flat-cluster approach, K-d_tree s work fine for points in 2d, 3d, 20d, even 128d -- not your case. I know hardly anything about clustering text; Locality-sensitive_hashing ?

Take a look at scikit-learn clustering -- it has several methods, including DBSCAN.

Added: see also
google-all-pairs-similarity-search "Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics