Clustering words based on Distance Matrix

Question

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.

take a look at code.google.com/p/em-python and "en.wikipedia.org/wiki/Expectation–maximization_algorithm" — Moj
@Moj I'm sorry...I can't seem to figure out how the information contained in the links you have mentioned are relevant here — user2115183
(EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the. I guess this fit you goal as also don't know the number of clusters before hand. those are two libraries(or implementation ) of this algorithm . — Moj
@Moj I was hoping something along the lines of k-means or hierarchical clustering...i know these require the number of clusters to be known beforehand.....but i hope there are ways to figure out the optimum number of clusters — user2115183

Unknown Unknown · Accepted Answer · 2013-04-27T12:59:29

You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm. DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices. You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.

There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.

disclosure: I'm a scikit-learn core dev.

Clustering words based on Distance Matrix

3 Answers