Text clustering: chosing the k in k means

Question

After eliminating stop words and applied the stemming process in a set of documents, I applied bisecting K-means in Javascript in order to cluster a set of documents received from some web pages for finding similarity between them.

What should be a good method for finding how many cluster should be created when having text-based clusters? I saw some methods such as Elbow, Silhouette, or information criterion approaches, but assuming I don't have any information of the clusters I create, the other methods seem to be better fit for numeric clustering, not on text-based clusters.

Can entropy be a good measure in helping me to find the right number of clusters after applying bisecting k-means in text clustering? Or F-measure? I mean to stop dividing into cluster after a certain value is reached? Will those be good for large sets of data?

Then how can I determine the number of K? in text clustering? Any ideas? — user3026017
Do you have many small documents or a few long documents? - Do multiple occurences of the same word indicate greater similarity? Or just is it the occurence of unique words that matters? — knb
I have many small documents. I think similarity should be given by multiple occurences of the same word. Stop words (such as "the", "a" and similar) are previously eliminated. Words rarely used should be more relevant, but how can I identify them? — user3026017

knb knb · Accepted Answer · 2017-08-16T08:26:46

Short answer:

You can use TermFequency- InverseDocument-Frequency (Tf-Idf). It emphasises rare words specifically used in a single document, and it penalizes those words when found in all the documents. If you applied PCA with TfIDF on your dataset, you might use the "Scree Plot" (~ Elbow method) to find a suitable number of clusters.

Long example:

The following is an example of NOT using kmeans, and the example uses a few long documents, and it has been decided that there are two "clusters" (using Principal components and Tf-Idf, actually), but it uses real data in a creative way:

In a PhD dissertation documenting the "textmining" package tm developed for the R software, the author of tm, Ingo Feinerer, gives an example (Chapter 10) how to do stylometry, that is to cluster/identify the 5 books from the "Wizard of Oz" series. For one of these books the authorship is disputed (there are two authors in the series, Thompson and Baum, but their contributions to one of the books are unknown).

Feinerer chops up the the documents into 500-line chunks to build a TermDocumentMatrix, then performs variants of Principal Component Analysis (PCA), one with TfIDF, on the Matrix, and shows by visual inspection of the PCA plots that the disputed book tends to be authored by Thompson. But parts might have been written by Baum.

In the plot this is indicated by points inside the pinkish wiggly oval (drawn by me). The green dots are chunks from a book with known authorship (T.) and the yellow dots are from the book of unknown / disputed authorship. (The points fall close together in the plot. That's the evidence here; it's qualitative, but this is just one example of many in the PDF)

The Tf-IDF PCA plot on page 95 looks similar.

I have not given any R code because I don't know if you like R, and this post is already getting too long, and you can read it up yourself in the PDF.

(And I don't know any implementations of Tf-IDF in Javascript).

Text clustering: chosing the k in k means

1 Answers