0
votes

I need to cluster 500K+ strings based on their similarity.

I have calculated their pair-wise Levenshtein Distances and made a sparse similarity matrix. This matrix contains binary similarities: values for small distances are set to 1.0 and others are 0.0.

I don't know what kind of clustering is good for me. I don't know the number of clusters in advance but it may be considerably large because the similarity matrix is very sparse (about 0.1% values are non-zero).

1
Just to be sure, you have calculated pairwise LD with 500K string, that mean you have a matrix of 500K x 500K size?Tấn Nguyên
Kind of. The LDs are calculated but not stored. I only stored the sparse binary similarity matrix using scipy.sparse.lil_matrix.landings

1 Answers

2
votes

have you considered doing something like https://en.wikipedia.org/wiki/Soundex ? the advantage in such algorithms is that similar words have the same canonical form. For example, both "Robert" and "Rupert" return the same string "R163". Then your clustering boils down to a map like:

clusters = { canonical_form: [list of similar words] }

Naturally, you can tweak the Soundex rules according to your domain.