24
votes

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.

3
take a look at code.google.com/p/em-python and "en.wikipedia.org/wiki/Expectation–maximization_algorithm"Moj
@Moj I'm sorry...I can't seem to figure out how the information contained in the links you have mentioned are relevant hereuser2115183
(EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the. I guess this fit you goal as also don't know the number of clusters before hand. those are two libraries(or implementation ) of this algorithm .Moj
@Moj I was hoping something along the lines of k-means or hierarchical clustering...i know these require the number of clusters to be known beforehand.....but i hope there are ways to figure out the optimum number of clustersuser2115183

3 Answers

15
votes

You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm. DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices. You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.

There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.

disclosure: I'm a scikit-learn core dev.

9
votes

The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.

0
votes

Recommend to take a look at agglomerative clustering.