0
votes

I have a large set of diagnosis code sequences that I am trying to cluster based on similarity. I created a distance matrix by computing the similarity using the least common subsequence algorithm then subtracting that similarity from 1 to find the distance between each sequence.

I then passed that distance matrix into sklearn's DBSCAN as so:

db = DBSCAN(eps=0.34, metric='precomputed')
db.fit(sim_mat)

After creating the clusters, I output the sequences contained in each one to a text file. Each of the clusters makes sense to me except for one. For example, this cluster makes sense to me, as each sequence has two of the codes in common and in the same order:

['345.3', '345.11']['345.3', '345.11', '038.9', '038.0', '276.51']['345.3', '345.11']['322.9', '345.3', '345.11']

This cluster, however, (shortened here because it contains 2852 sequences) does not make sense to me, as none of the sequences have any codes in common:

['162.3', '038.9']['578.1', '584.9']['416.8', '486', '486', '038.11']['493.92', '428.0', '584.9', '427.89']['414.01', '998.59']

My question is if this is a bug in DBSCAN or if I am misunderstanding how to use it and/or how it should work. Furthermore, if this is a bug or the expected output of the algorithm, is there another one that I should look into using?

2

2 Answers

1
votes

By design (the letter N in DBSCAN) the algorithm also recognizes objects that do not belong into any cluster, referred to as noise.

If you incorrectly treat "noise" as one cluster, they will of course appear entirely unrelated.

Some samples often just don't fit any cluster, so this is a feature, not a limitation. You could assign each point to the same cluster as the nearest clustered point, but that does not increase the cluster quality.

0
votes

I figured it out. Based on the description of DBSCAN, https://en.wikipedia.org/wiki/DBSCAN, it seems that this behavior is normal. Essentially, the algorithm starts with one item, finds its neighbors within the desired distance, then continues to find neighbors for each point in that cluster. Thus, you can end up with a cluster that has points that are actually quite far from each other.

To get around this, I chose to use Affinity Propagation instead - https://en.wikipedia.org/wiki/Affinity_propagation