I'm trying to implement k-means for text clustering, specifically English sentences. So far I'm at the point where I have a term frequency matrix for each document (sentence). I'm a little confused on the actual implementation of k-means on text data. Here's my guess of how it should work.
Figure out the number of unique words in all sentences (a large number, call it
n
).Create
k
n
dimensional vectors (clusters) and fill in the values of thek
vectors with some random numbers (how do I decide what the bounds for these numbers are?)Determine the Euclidean distance from each of the
q
sentences to the randomk
clusters, reposition clusters, etc. (Ifn
is very large like the English language, wouldn't calculating the Euclidean distance for these vectors be very costly?)
Thanks for any insight!