5
votes

Apologies if the answer to this is obvious, please be kind, this is my first time on here :-)

I would gratefully appreciate if someone could give me a steer on the appropriate input data structure for k-means. I am working on a masters dissertation in which I am proposing a new TF-IDF term weighing approach specific to my domain. I want to use k-means to cluster the results and then apply a number of internal and external evaluation criteria to see if my new term weighting method has any merit.

My steps so far (implemented in PHP), all working are

Step 1: Read in document collection Step 2: Clean document collection, feature extraction, feature selection Step 3: Term Frequency (TF) Step 4: Inverse Document Frequency (IDF) Step 5: TF * IDF Step 6: Normalise TF-IDF to fixed length vectors

Where I am struggling is

Step 7: Vector Space Model – Cosine Similarity

The only examples I can find, compare an input query to each document and find the similarity. Where there is no input query (this is not an information retrieval system) do I compare every single document in the corpus with every other document in the corpus (every pair of documents)? I cannot find any example of Cosine Similarity applied to a full document collection rather than a single example/query compared to the collection.

Step 8: K-Means

I am struggling here to understand if the input vector for k-means should contain a matrix of the cosine similarity score of every document in the collection against every other document (a matrix of cosine similarity). Or is k-means supposed to be applied over a term vector model. If it is the latter, every example I can find of k-means is quite basic and plots either singular terms. How do I handle the fact that there are multiple terms in my document collection etc.

Cosine Similarity and K-Means are implied as the solution to document clustering on so many examples so i am missing something very obvious.

If anyone could give me a steer I would be forever grateful.

Thanks

Claire

5
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. I would suggest that you find a development forum (perhaps redit?) to work out generalities. Then, when you have specific coding issues, come back to StackOverflow and we'll be glad to help.Jay Blanchard

5 Answers

0
votes

K-means cannot operate on a similarity matrix.

Because k-means computes point-to-mean distances, not pairwise distances.

You need an implementation of spherical k-means if you want to use Cosine distance: at every iteration, the centers should be L2 normalized.

If I'm not mistaken, it should be equivalent to run k-means with cosine similarity, and only normalize the center to unit length at the end. But regular spherical k-means may be faster, because you can exploit data normalization to simplify cosine distance to the dot product.

You may want to reconsider using PHP. It is one of the worst possible choices for this type of programming task. It's good for interactive web page, but it doesn't shine at data analysis at all.

0
votes

I second Anony-Mousse opinion that you should reconsider PHP and would like to suggest Python as there several useful libraries for these kind of problems:

Numpy: a great and efficient package for scientific computing.

SciPy: Actually has several routines for k-means clustering: see here

Theano: For more machine learning needs, especially deep learning.

Also there is this great tutorial about the k means algorithm. It also supplies pseudo code in Python. You can use this and maybe an implementation done by yourself to understand the algorithm better but ultimately I would make use of the library mentioned above as they are optimized for performance which is definitely something to keep in mind if you have a big collection of documents.

0
votes

If it helps anybody else out, I have found that it is possible to k-means clustering a multi-dimensional term vector but if more than 3 dimenions are included (which will be the case for any document collection), you cannot visualise it. I believe this is what threw me here, all of the examples I saw of k-means included a graph visualisation, this led me to believe, incorrectly ,that perhaps the source data for k-means was expected to be two dimensional, such as 0 and the cosine similarity. Thank you kindly for the respondents for your help, much appreciated.

0
votes

Use TF-IDF to calculate the cosine similarity. Use cosine similarity scores as the input data for your clustering algorithm.