Apologies if the answer to this is obvious, please be kind, this is my first time on here :-)
I would gratefully appreciate if someone could give me a steer on the appropriate input data structure for k-means. I am working on a masters dissertation in which I am proposing a new TF-IDF term weighing approach specific to my domain. I want to use k-means to cluster the results and then apply a number of internal and external evaluation criteria to see if my new term weighting method has any merit.
My steps so far (implemented in PHP), all working are
Step 1: Read in document collection Step 2: Clean document collection, feature extraction, feature selection Step 3: Term Frequency (TF) Step 4: Inverse Document Frequency (IDF) Step 5: TF * IDF Step 6: Normalise TF-IDF to fixed length vectors
Where I am struggling is
Step 7: Vector Space Model – Cosine Similarity
The only examples I can find, compare an input query to each document and find the similarity. Where there is no input query (this is not an information retrieval system) do I compare every single document in the corpus with every other document in the corpus (every pair of documents)? I cannot find any example of Cosine Similarity applied to a full document collection rather than a single example/query compared to the collection.
Step 8: K-Means
I am struggling here to understand if the input vector for k-means should contain a matrix of the cosine similarity score of every document in the collection against every other document (a matrix of cosine similarity). Or is k-means supposed to be applied over a term vector model. If it is the latter, every example I can find of k-means is quite basic and plots either singular terms. How do I handle the fact that there are multiple terms in my document collection etc.
Cosine Similarity and K-Means are implied as the solution to document clustering on so many examples so i am missing something very obvious.
If anyone could give me a steer I would be forever grateful.
Thanks
Claire