0
votes

I am using Sander et al. 1998 to determine MinPts and epsilon to use DBSCAN on my dataset. As Sanders et all suggests minpts=dim*2-1=k (in my case 9 dimensions --> minpts=k=17). In the paper one should chose the "first valley". I can see two valleys but which one is the first one? And what value would you chose for epsilon? kdistplot_with_duplicates

Since Sanders also suggests that this method should be only used if there are no duplicates, one without them: (though I think in this case it should not matter) kdistplot_without_duplicates. Which valley should be considered the "first" one?

Code used:

ns = 17
nbrs = NearestNeighbors(n_neighbors=ns, metric='euclidean').fit(data)
distances, indices = nbrs.kneighbors(data)
distanceDec = sorted(distances[:,ns-1], reverse=True)
plt.plot(list(range(1,683+1)), distanceDec)
2

2 Answers

0
votes

This indicates there may be hierarchies of clusters, or clusters with varying density.

In such cases, a single threshold on DBSCAN will not be enough. You can try clustering twice, with two different thresholds. Or you use a hierarchical version such as OPTICS and HDBSCAN. Recently, people have been quite happy with HDBSCAN, I have had better results with OPTICS (and I believe there is a good reason why, namely that I want border points to be part of the cluster)

0
votes

It is the valley on the left (the smallest values of epsilon): with this value, all the points on the left are unclustered (considered as noise) and all the points on the right will be clustered.

You can read the original DBSCAN paper and in particular see Figure 4 to better understand the rationale.