Is this the expected behavior of the DBSCAN algorithm (two identical data samples not fitting in the same cluster)?

Question

Please forgive the lack of formal terms, I've only recently approached ML.

For learning purposes, I decided to try a Ruby implementation of the DBSCAN algorithm (https://github.com/matiasinsaurralde/dbscan).

Building on the simple example at https://github.com/matiasinsaurralde/dbscan/blob/master/examples/simple.rb I created an array of 1000 arrays, each holding two random values, "x" and "y" (a 2d point) and I've then fed this data to the DBSCAN algorithm (tweaking "epsilon" and "minimum distance" as needed).

data_sample = Array.new(1000) { Array.new(2) { rand(100).round } }

dbscan = DBSCAN( data_sample, :epsilon => 3, :min_points => 2, :distance => :euclidean_distance )

and then I've then exported the resulting data (clusters and unclustered data) to Excel, to draw a graph of clusters and unclustered data.

This is what I came up with:

Black dots is unclustered data.

Now there is one thing I am unsure about: for some points very close to each other, or points who share the same exact x and y, I am seeing that one of the two points, instead of making it into the sane cluster of the other point, is categorized as unclustered.

Take a look at point 47, 74: the point belonging to a cluster is "above" another unclustered point. This also happens at 14, 87, at 77,64, at 20,61 and in many other places (for some couple of points, they have the same x and y).

Now as I said I'm still approaching to this, so could anyone please tell me if there is an explanation for what I am seeing? Does it have to do with the inner workings of the DBSCAN algorithm? Or is it more likely that there are some errors in the implementation of the algorithm? Or some wrong assumption I am making?

I hope this is all you need to know but if you need more please just ask.

Compare to known-good implementations such as the one in ELKI. — Has QUIT--Anony-Mousse

lejlot lejlot · Accepted Answer · 2016-05-08T23:19:58

So there are actually two questions inside:

Is it possible that two very close points have distinct labeling? In particular one is "unclustered"?

Yes, it is possible and comes directly from the dbscan method, which in particular requires a point to have a given amount of close neighbours, in order to be classified as anything other than "unclustered"

is it possible that two points of exactly the same position end up in two different clusters?

No, it is not possible. Thus either these points are not identical (maybe you are comparing their rounded representation, and not the true ones?), or implementation has an error.

Is this the expected behavior of the DBSCAN algorithm (two identical data samples not fitting in the same cluster)?

2 Answers