3
votes

I have the following points in 3D space:

enter image description here

I need to group the points, according to D_max and d_max:

D_max = max dimension of each group
d_max = max distance of points inside each group

Like this:

enter image description here

The shape of the group in the above image looks like a box, but the shape can be anything which would be the output of the grouping algorithm.


I'm using Python and visualize the results with Blender. I'm considering using the scipy.spatial.KDTree and calling its query API, however, I'm not sure if that's the right tool for the job at hand. I'm worried that there might be a better tool which I'm not aware of. I'm curious to know if there is any other tool/library/algorithm which can help me.


As @CoMartel pointed out, there is DBSCAN and also HDBSCAN clustering modules which look like a good fit for this type of problems. However, as pointed out by @Paul they lack the option for max size of the cluster which correlates to my D_max parameter. I'm not sure how to add a max cluster size feature to DBSCAN and HDBSCAN clustering.


Thanks to @Anony-Mousse I watched Agglomerative Clustering: how it works and Hierarchical Clustering 3: single-link vs. complete-link and I'm studying Comparing Python Clustering Algorithms, I feel like it's getting more clear how these algorithms work.

3
It looks like a DBScan could do a better job : scikit-learn.org/stable/modules/generated/… - CoMartel
There is also a derivative called HDBScan, with some differences that might be good for you - CoMartel
DBScan only considers d_max but not D_max (to stay in the parlance of this post). Also, I am not sure how well DBScan will work anyway, as most of the points seem to be on a grid, and hence most nearest neighbors will be equally far away. - Paul Brodersen
If you do apply DBSCAN first, what is the maximum distance of points within a cluster (let's call it D)? How does that compare to your target D_max? You may find that that currently, D >> D_max, in which case you probably would just want to pre-cluster your points with DBSCAN, and then partition those clusters using some other algorithm to enforce your D_max criterion (which to me really seems more of a packaging problem than a clustering problem). - Paul Brodersen
@CoMartel Can you please post your comments on DBSCAN and HDBSCAN as an answer so that I can accept it and close the question as answered. Your suggestion of those two algorithms actually solved my problem. - user3405291

3 Answers

2
votes

As requested, my comment as an answer :

You could use DBSCAN(http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or HDBSCAN.

Both these algorithm allow to group each point according to d_max (maximum distance between 2 points of the same dataset), but they don't take the maximum cluster size. The only way to limit the maximum size of a cluster is by reducing the epsparameter, which control the max distance between 2 points of the same cluster.

2
votes

Use hierarchical agglomerative clustering.

If you use complete linkage you can control the maximum diameter of the clusters. The complete link is the maximum distance.

DBSCAN's epsilon parameter is not a maximum distance because multiple steps are joined transitively. Clusters can become much larger than epsilon!

0
votes

DBSCAN clustering algorithm with the maximum distance of points inside each group extension

You can use the DBSCAN algorithm recursively.

def DBSCAN_with_max_size(myData, eps = E, max_size = S):
     clusters = DBSCAN(myData, eps = E)
     Big_Clusters = find_big_clusters(clusters)
     for big_cluster in Big_Clusters:
         DBSCAN_with_max_size(big_cluster ,eps = E/2 ,max_size = S) //eps is something lower than E (e.g. E/2)