DBSCAN kdist-Plot multiple valleys

Question

I am using Sander et al. 1998 to determine MinPts and epsilon to use DBSCAN on my dataset. As Sanders et all suggests minpts=dim*2-1=k (in my case 9 dimensions --> minpts=k=17). In the paper one should chose the "first valley". I can see two valleys but which one is the first one? And what value would you chose for epsilon? kdistplot_with_duplicates

Since Sanders also suggests that this method should be only used if there are no duplicates, one without them: (though I think in this case it should not matter) kdistplot_without_duplicates. Which valley should be considered the "first" one?

Code used:

ns = 17
nbrs = NearestNeighbors(n_neighbors=ns, metric='euclidean').fit(data)
distances, indices = nbrs.kneighbors(data)
distanceDec = sorted(distances[:,ns-1], reverse=True)
plt.plot(list(range(1,683+1)), distanceDec)

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2019-11-16T17:19:31

This indicates there may be hierarchies of clusters, or clusters with varying density.

In such cases, a single threshold on DBSCAN will not be enough. You can try clustering twice, with two different thresholds. Or you use a hierarchical version such as OPTICS and HDBSCAN. Recently, people have been quite happy with HDBSCAN, I have had better results with OPTICS (and I believe there is a good reason why, namely that I want border points to be part of the cluster)

DBSCAN kdist-Plot multiple valleys

2 Answers