I just thought about some special cases for DBSCAN. The case is illustrated here. Assume eps equals to the radius of the circles. For MinPts=3 p and r are corepoints. It is unclear wether q belongs to the cluster of p or of r. If a recursive implementation is used and the algorithm checks r first, q would be part of the cluster of r. Hence p would define a cluster with just two elements. The original paper states: "Note that a cluster wrt. Eps and MinPts contains at least MinPts points [...]" Am I missing something here or was this special case just not considered?
1 Answers
In example, q is also a core point: there are three points in the circle: p, q, r. You need minPts=4 in this example.
You need to distinguish the theoretical definition of a density cluster from the efficient algorithm output, which only "almost" gives the theoretical result for a good reason: In the theoretical model, q would be part of both clusters. But that is inconvenient and surprising to users.
You are not the first to notice. Even Wikipedia knows this:
While minPts intuitively is the minimum cluster size, in some cases DBSCAN can produce smaller clusters.[5] A DBSCAN cluster consists of at least one core point.[5] As other points may be border points to more than one cluster, there is no guarantee that at least minPts points are included in every cluster.
Reference [5] is the article
Schubert, Erich; Sander, Jörg; Ester, Martin; Kriegel, Hans Peter; Xu, Xiaowei (July 2017). "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN". ACM Trans. Database Syst. 42 (3): 19:1–19:21. doi:10.1145/3068335. ISSN 0362-5915.
Which contains the footnote:
Note that this can, in rare cases, lead to a cluster with fewer than minPts points, if too many border points are reachable by different clusters, and have previously been assigned to other clusters. Every cluster will at least have one core point. Multi-assignment to exactly represent the theoretical model—or an assignment by shortest distance—can be implemented easily.