I have a large set of diagnosis code sequences that I am trying to cluster based on similarity. I created a distance matrix by computing the similarity using the least common subsequence algorithm then subtracting that similarity from 1 to find the distance between each sequence.
I then passed that distance matrix into sklearn's DBSCAN as so:
db = DBSCAN(eps=0.34, metric='precomputed')
db.fit(sim_mat)
After creating the clusters, I output the sequences contained in each one to a text file. Each of the clusters makes sense to me except for one. For example, this cluster makes sense to me, as each sequence has two of the codes in common and in the same order:
['345.3', '345.11']['345.3', '345.11', '038.9', '038.0', '276.51']['345.3', '345.11']['322.9', '345.3', '345.11']
This cluster, however, (shortened here because it contains 2852 sequences) does not make sense to me, as none of the sequences have any codes in common:
['162.3', '038.9']['578.1', '584.9']['416.8', '486', '486', '038.11']['493.92', '428.0', '584.9', '427.89']['414.01', '998.59']
My question is if this is a bug in DBSCAN or if I am misunderstanding how to use it and/or how it should work. Furthermore, if this is a bug or the expected output of the algorithm, is there another one that I should look into using?