I need to use bag of words (in this case bag of features) to generate descriptor vectors to classify the KTH video dataset. In order to do this, I need to use kmeans clustering algorithm to cluster the extracted features and find the codebook. The extracted features from dataset form approximately 75000 vectors of 100 elements each. So I'm facing memory issues using the scipy.cluster.kmeans2 implementation in Ubuntu. I runed some tests and discovered that with 32000 vector with 100 elements each, the amount of memory used is around 20GB (my total memory is 32GB).
Is there any other Python kmeans implementation more memory effcient? I already read about Mahout for clustering big data, but I still not understand what is his advantages, is it more memory-efficient with that mentioned amount of data?