2
votes

Please forgive the lack of formal terms, I've only recently approached ML.

For learning purposes, I decided to try a Ruby implementation of the DBSCAN algorithm (https://github.com/matiasinsaurralde/dbscan).

Building on the simple example at https://github.com/matiasinsaurralde/dbscan/blob/master/examples/simple.rb I created an array of 1000 arrays, each holding two random values, "x" and "y" (a 2d point) and I've then fed this data to the DBSCAN algorithm (tweaking "epsilon" and "minimum distance" as needed).

data_sample = Array.new(1000) { Array.new(2) { rand(100).round } }

dbscan = DBSCAN( data_sample, :epsilon => 3, :min_points => 2, :distance => :euclidean_distance )

and then I've then exported the resulting data (clusters and unclustered data) to Excel, to draw a graph of clusters and unclustered data.

This is what I came up with:

enter image description here Black dots is unclustered data.

Now there is one thing I am unsure about: for some points very close to each other, or points who share the same exact x and y, I am seeing that one of the two points, instead of making it into the sane cluster of the other point, is categorized as unclustered.

Take a look at point 47, 74: the point belonging to a cluster is "above" another unclustered point. This also happens at 14, 87, at 77,64, at 20,61 and in many other places (for some couple of points, they have the same x and y).

Now as I said I'm still approaching to this, so could anyone please tell me if there is an explanation for what I am seeing? Does it have to do with the inner workings of the DBSCAN algorithm? Or is it more likely that there are some errors in the implementation of the algorithm? Or some wrong assumption I am making?

I hope this is all you need to know but if you need more please just ask.

2
Compare to known-good implementations such as the one in ELKI.Has QUIT--Anony-Mousse

2 Answers

2
votes

So there are actually two questions inside:

  1. Is it possible that two very close points have distinct labeling? In particular one is "unclustered"?

Yes, it is possible and comes directly from the dbscan method, which in particular requires a point to have a given amount of close neighbours, in order to be classified as anything other than "unclustered"

  1. is it possible that two points of exactly the same position end up in two different clusters?

No, it is not possible. Thus either these points are not identical (maybe you are comparing their rounded representation, and not the true ones?), or implementation has an error.

0
votes

Some rather obvious properties of DBSCAN labeled points:

  1. Neighbors of a core point must have the same label as the core point (with a rare exception for border points that have previously been assigned to a different cluster)
  2. Neighbors of a non-core point can have different labels, but at least one must be a core point of the same label
  3. Neighbors of a noise point must not be core points

Number 2 means that non-core points can be arbitrary close and have different labels.

Technically it even is possible that two points at the same coordinate have different labels, if the points are both border points. But that will probably only happen in a parallel variant, I'm not aware of one where this actually would happen. The usual DBSCAN algorithm would assign them the same label (on first discovery).

Border points cause weird anomalies, which why the authors removed them in HDBSCAN*.