k-means for text clustering

Question

I'm trying to implement k-means for text clustering, specifically English sentences. So far I'm at the point where I have a term frequency matrix for each document (sentence). I'm a little confused on the actual implementation of k-means on text data. Here's my guess of how it should work.

Figure out the number of unique words in all sentences (a large number, call it n).
Create k n dimensional vectors (clusters) and fill in the values of the k vectors with some random numbers (how do I decide what the bounds for these numbers are?)
Determine the Euclidean distance from each of the q sentences to the random k clusters, reposition clusters, etc. (If n is very large like the English language, wouldn't calculating the Euclidean distance for these vectors be very costly?)

Thanks for any insight!

Gordon Linoff Gordon Linoff · Accepted Answer · 2016-11-03T02:42:07

This is a bit long for a comment.

If you have a document term matrix, then find the principal components (of the covariance matrix). Determine the coefficients of the original data in the principal component space. You can do k-means clustering in this space.

With text data, you generally need a bunch of dimensions -- 20, 50, 100, or even more. Also, I would recommend Gaussian mixture models/expectation-maximization clustering instead of k-means, but that is another story.

k-means for text clustering

2 Answers