Mini-batch k-means returns less than k clusters

Question

I've been working with mini-batch k-means using the scikit-learn implementation to cluster datasets of about 45000 observations with about 170 features each. I noticed that the algorithm has trouble returning the specified number of clusters as k increases, and if k goes beyond about 30% of the number of observations in the dataset (30% of 45000) and continues to increase, the returned number of clusters does not increase anymore.

I was wondering if this has to do with the way the algorithm was implemented in scikit-learn or if it has to do with its definition. I've been studying the paper where it was proposed but I can't figure out why this would happen.

Has anyone experienced this? Does anyone now how to explain this behavior?

Which version of scikit-learn are you using? What is the batch_size? The batch_size should be significantly larger than the number of clusters for the algorithm to work properly. Don't you get any warning message? — ogrisel
I always use a batch_size much larger than k, but I suppose that if k is already very large compared with the dataset size, then batch_size will never be large enough. That might be an explanation. — c_david

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2014-07-23T21:35:18

k-means can fail in the sense that clusters can disappear.

This is most evident when you have a lot of duplicates.

If all your data points are identical, why should there be more than one (non-empty) cluster, ever?

It's not specific to mini-batch k-means as far as I can tell. Some implementations let you specify what to do when a cluster degenerates, e.g. use the farthest point as new cluster center, discard the cluster, or leave it unchanged (maybe it will pick up a point again).

Mini-batch k-means returns less than k clusters

2 Answers