6
votes

I've been working with mini-batch k-means using the scikit-learn implementation to cluster datasets of about 45000 observations with about 170 features each. I noticed that the algorithm has trouble returning the specified number of clusters as k increases, and if k goes beyond about 30% of the number of observations in the dataset (30% of 45000) and continues to increase, the returned number of clusters does not increase anymore.

I was wondering if this has to do with the way the algorithm was implemented in scikit-learn or if it has to do with its definition. I've been studying the paper where it was proposed but I can't figure out why this would happen.

Has anyone experienced this? Does anyone now how to explain this behavior?

2
Which version of scikit-learn are you using? What is the batch_size? The batch_size should be significantly larger than the number of clusters for the algorithm to work properly. Don't you get any warning message?ogrisel
I always use a batch_size much larger than k, but I suppose that if k is already very large compared with the dataset size, then batch_size will never be large enough. That might be an explanation.c_david

2 Answers

4
votes

k-means can fail in the sense that clusters can disappear.

This is most evident when you have a lot of duplicates.

If all your data points are identical, why should there be more than one (non-empty) cluster, ever?

It's not specific to mini-batch k-means as far as I can tell. Some implementations let you specify what to do when a cluster degenerates, e.g. use the farthest point as new cluster center, discard the cluster, or leave it unchanged (maybe it will pick up a point again).

0
votes

If you are using K-means you need to tell the algorithm the number of clusters to use, it' can't tell for itself.

Cluster methods implemented by using a distance function in order to find a (global but not really) minimum using a defined metric (like eucludian). The problem you have is related with how to determine the number of clusters, that is an heuristic problem as when you increase the number of clusters to use, the objective function decreases faster hence increasing the number of clusters won't let you find the optimum clusters in your dataset. You'll get stuck with noisy clusters that aren't really different.

You can refer to Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666. regarding this issue.