Is it normal to get a cluster that is not very similar when using sklearn DBSCAN?

Question

I have a large set of diagnosis code sequences that I am trying to cluster based on similarity. I created a distance matrix by computing the similarity using the least common subsequence algorithm then subtracting that similarity from 1 to find the distance between each sequence.

I then passed that distance matrix into sklearn's DBSCAN as so:

db = DBSCAN(eps=0.34, metric='precomputed')
db.fit(sim_mat)

After creating the clusters, I output the sequences contained in each one to a text file. Each of the clusters makes sense to me except for one. For example, this cluster makes sense to me, as each sequence has two of the codes in common and in the same order:

['345.3', '345.11']['345.3', '345.11', '038.9', '038.0', '276.51']['345.3', '345.11']['322.9', '345.3', '345.11']

This cluster, however, (shortened here because it contains 2852 sequences) does not make sense to me, as none of the sequences have any codes in common:

['162.3', '038.9']['578.1', '584.9']['416.8', '486', '486', '038.11']['493.92', '428.0', '584.9', '427.89']['414.01', '998.59']

My question is if this is a bug in DBSCAN or if I am misunderstanding how to use it and/or how it should work. Furthermore, if this is a bug or the expected output of the algorithm, is there another one that I should look into using?

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2017-07-16T07:28:24

By design (the letter N in DBSCAN) the algorithm also recognizes objects that do not belong into any cluster, referred to as noise.

If you incorrectly treat "noise" as one cluster, they will of course appear entirely unrelated.

Some samples often just don't fit any cluster, so this is a feature, not a limitation. You could assign each point to the same cluster as the nearest clustered point, but that does not increase the cluster quality.

Is it normal to get a cluster that is not very similar when using sklearn DBSCAN?

2 Answers