Using k-means for document clustering, should clustering be on cosine similarity or on term vectors?

Question

Apologies if the answer to this is obvious, please be kind, this is my first time on here :-)

I would gratefully appreciate if someone could give me a steer on the appropriate input data structure for k-means. I am working on a masters dissertation in which I am proposing a new TF-IDF term weighing approach specific to my domain. I want to use k-means to cluster the results and then apply a number of internal and external evaluation criteria to see if my new term weighting method has any merit.

My steps so far (implemented in PHP), all working are

Step 1: Read in document collection Step 2: Clean document collection, feature extraction, feature selection Step 3: Term Frequency (TF) Step 4: Inverse Document Frequency (IDF) Step 5: TF * IDF Step 6: Normalise TF-IDF to fixed length vectors

Where I am struggling is

Step 7: Vector Space Model – Cosine Similarity

The only examples I can find, compare an input query to each document and find the similarity. Where there is no input query (this is not an information retrieval system) do I compare every single document in the corpus with every other document in the corpus (every pair of documents)? I cannot find any example of Cosine Similarity applied to a full document collection rather than a single example/query compared to the collection.

Step 8: K-Means

I am struggling here to understand if the input vector for k-means should contain a matrix of the cosine similarity score of every document in the collection against every other document (a matrix of cosine similarity). Or is k-means supposed to be applied over a term vector model. If it is the latter, every example I can find of k-means is quite basic and plots either singular terms. How do I handle the fact that there are multiple terms in my document collection etc.

Cosine Similarity and K-Means are implied as the solution to document clustering on so many examples so i am missing something very obvious.

If anyone could give me a steer I would be forever grateful.

Thanks

Claire

Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. I would suggest that you find a development forum (perhaps redit?) to work out generalities. Then, when you have specific coding issues, come back to StackOverflow and we'll be glad to help. — Jay Blanchard

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2015-05-11T13:06:47

K-means cannot operate on a similarity matrix.

Because k-means computes point-to-mean distances, not pairwise distances.

You need an implementation of spherical k-means if you want to use Cosine distance: at every iteration, the centers should be L2 normalized.

If I'm not mistaken, it should be equivalent to run k-means with cosine similarity, and only normalize the center to unit length at the end. But regular spherical k-means may be faster, because you can exploit data normalization to simplify cosine distance to the dot product.

You may want to reconsider using PHP. It is one of the worst possible choices for this type of programming task. It's good for interactive web page, but it doesn't shine at data analysis at all.

Using k-means for document clustering, should clustering be on cosine similarity or on term vectors?

5 Answers