34
votes

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:

library(fpc)
ds <- dbscan(USArrests,eps=20)

Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".

This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.

6

6 Answers

30
votes

There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.

For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.

OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.

Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.

The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.

15
votes

One common and popular way of managing the epsilon parameter of DBSCAN is to compute a k-distance plot of your dataset. Basically, you compute the k-nearest neighbors (k-NN) for each data point to understand what is the density distribution of your data, for different k. the KNN is handy because it is a non-parametric method. Once you choose a minPTS (which strongly depends on your data), you fix k to that value. Then you use as epsilon the k-distance corresponding to the area of the k-distance plot (for your fixed k) with a low slope.

15
votes

MinPts

As Anony-Mousse explained, 'A low minPts means it will build more clusters from noise, so don't choose it too small.'.

minPts is best set by a domain expert who understands the data well. Unfortunately many cases we don't know the domain knowledge, especially after data is normalized. One heuristic approach is use ln(n), where n is the total number of points to be clustered.

epsilon

There are several ways to determine it:

1) k-distance plot

In a clustering with minPts = k, we expect that core pints and border points' k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide

2) DBSCAN extensions like OPTICS

OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.

3) sensitivity analysis

Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.

11
votes

For details on choosing parameters, see the paper below on p. 11:

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.

  • For two-dimensional data: use default value of minPts=4 (Ester et al., 1996)
  • For more than 2 dimensions: minPts=2*dim (Sander et al., 1998)

Once you know which MinPts to choose, you can determine Epsilon:

  • Plot the k-distances with k=minPts (Ester et al., 1996)
  • Find the 'elbow' in the graph--> The k-distance value is your Epsilon value.
0
votes

If you have the resources, you can also test a bunch of epsilon and minPts values and see what works. I do this using expand.grid and mapply.

# Establish search parameters.
k <- c(25, 50, 100, 200, 500, 1000)
eps <- c(0.001, 0.01, 0.02, 0.05, 0.1, 0.2)

# Perform grid search.
grid <- expand.grid(k = k, eps = eps)

results <- mapply(grid$k, grid$eps, FUN = function(k, eps) {
  cluster <- dbscan(data, minPts = k, eps = eps)$cluster
  sum <- table(cluster)
  cat(c("k =", k, "; eps =", eps, ";", sum, "\n"))
})