2
votes

I have 4-dimensional data which needs to be clustered to build minimum volume bounding ellipsoids for each cluster. I don't want to have single point clusters or at least, as less number of single point clusters as possible, because we can't build an ellipsoidal confidence region with a single point. In my problem, number of clusters are not given in advance. So I am using Scikit-learn's Affinity Propagation - http://scikit-learn.org/stable/modules/clustering.html#affinity-propagation to estimate the number of clusters and perform clustering from the data. But this approach is giving me so many single point clusters. Can you give an Insight on how to solve this problem ?

P.S : To give you even more information, I am working on ellipsoidal nested sampling for Bayesian evidence calculation.

2
Maybe remove outliers first? - Has QUIT--Anony-Mousse
Removing the outliers would certainly help. You might even want to consider using a Gaussian Mixture Model. The number of component gaussians can be selected based on AIC or BIC criterion. - hrs

2 Answers

1
votes

I don't know if you insist on Affinity Propagation or not, but using DBSCAN, you can achieve what you want by algorithm parameters, eps and minPts.

Greater eps means clusters with less density can be detected and near clusters would be merged also.

Greater minPts means you would mark more data as noise.

0
votes

Affinity Propagation doesn't have a concept of ellipsoids, so I am not sure it does what you want.

MultiNest solves this with a variant of X-means clustering. Basically, you make a wrapping ellipsoid covering all points first. Then, you identify the most important axis, and place two k-means ellipsoids at each end, and converge them. Now you can decide whether the two-ellipsoid uses much less volume than the single ellipsoid. Then the process is repeated recursively. nestle is a open-source implementation with quite clear code: https://github.com/kbarbary/nestle/

Alternatively, you can also put an ellipsoid around every point, and determine a radius using K-folding or bootstrapping. This is what MLFriends uses; implementation, explanation and animation at: https://johannesbuchner.github.io/UltraNest/method.html Here the ellipsoid covariance can be first chosen using the sample covariance, and when clusters are detected, they can be co-centred and a common sample covariance determined. For determining the radius, you leave out some test points, and make sure the ellipsoids around the training points are big enough to recover the test points. This is quite stable and parameter-free.

DBSCAN could also be interesting to explore. However, in the end you need hard, not fuzzy, proposal surfaces for nested sampling.