Python kmeans clustering for large datasets

Question

I need to use bag of words (in this case bag of features) to generate descriptor vectors to classify the KTH video dataset. In order to do this, I need to use kmeans clustering algorithm to cluster the extracted features and find the codebook. The extracted features from dataset form approximately 75000 vectors of 100 elements each. So I'm facing memory issues using the scipy.cluster.kmeans2 implementation in Ubuntu. I runed some tests and discovered that with 32000 vector with 100 elements each, the amount of memory used is around 20GB (my total memory is 32GB).

Is there any other Python kmeans implementation more memory effcient? I already read about Mahout for clustering big data, but I still not understand what is his advantages, is it more memory-efficient with that mentioned amount of data?

sascha sascha · Accepted Answer · 2017-11-13T17:59:03

When having many samples, consider using sklearn's MiniBatchKMeans, which is a SGD-like method build for this case! (A more tutorial-like intro which does not address memory-usage, but i expect it to be better there for large n_samples. Of course memory also depends on many other parameters like k ... In the case of huge n_features it won't help in regards to memory; but that's not your problem here)

In this case you should carefully tune your mini-batch sizes then.

You can try the classic kmeans implementation there too as you seem to be just quite off the memory-requirements and maybe this implementation is more efficient (more tunable for sure).

In the latter case, init, n_init, precompute_distances, algorithm and maybe copy_x are all parameters having effect on memory-consumption.

And furthermore: if(!) your data is sparse; try calling it with sparse-matrices. (from reading kmeans2-docs it seems it's not supported, but sklearn's kmeans does!)

Python kmeans clustering for large datasets

1 Answers