I need to cluster 500K+ strings based on their similarity.
I have calculated their pair-wise Levenshtein Distances and made a sparse similarity matrix. This matrix contains binary similarities: values for small distances are set to 1.0 and others are 0.0.
I don't know what kind of clustering is good for me. I don't know the number of clusters in advance but it may be considerably large because the similarity matrix is very sparse (about 0.1% values are non-zero).
scipy.sparse.lil_matrix
. – landings