0
votes

I'm trying to implement k-means for text clustering, specifically English sentences. So far I'm at the point where I have a term frequency matrix for each document (sentence). I'm a little confused on the actual implementation of k-means on text data. Here's my guess of how it should work.

  1. Figure out the number of unique words in all sentences (a large number, call it n).

  2. Create k n dimensional vectors (clusters) and fill in the values of the k vectors with some random numbers (how do I decide what the bounds for these numbers are?)

  3. Determine the Euclidean distance from each of the q sentences to the random k clusters, reposition clusters, etc. (If n is very large like the English language, wouldn't calculating the Euclidean distance for these vectors be very costly?)

Thanks for any insight!

2

2 Answers

1
votes

This is a bit long for a comment.

If you have a document term matrix, then find the principal components (of the covariance matrix). Determine the coefficients of the original data in the principal component space. You can do k-means clustering in this space.

With text data, you generally need a bunch of dimensions -- 20, 50, 100, or even more. Also, I would recommend Gaussian mixture models/expectation-maximization clustering instead of k-means, but that is another story.

1
votes

resurrecting a slightly old question here, but worth linking the two...

Generally, you'd use some kind of locally-sensitive hashing instead of relying on frequency of word occurrence. But either way, manually assembling the feature matrix is a huge hassle.

This SO answer gives you a guide to how to create that feature matrix from a list of documents, using scikit-learn and explaining the steps. I think it'll help show you the sequence of steps required.