Short answer:
You can use TermFequency- InverseDocument-Frequency (Tf-Idf). It emphasises rare words specifically used in a single document, and it penalizes those words when found in all the documents.
If you applied PCA with TfIDF on your dataset, you might use the "Scree Plot" (~ Elbow method) to find a suitable number of clusters.
Long example:
The following is an example of NOT using kmeans, and the example uses a few long documents, and it has been decided that there are two "clusters" (using Principal components and Tf-Idf, actually), but it uses real data in a creative way:
In a PhD dissertation documenting the "textmining" package tm
developed for the R
software, the author of tm
, Ingo Feinerer, gives an example (Chapter 10) how to do stylometry, that is to cluster/identify the 5 books from the "Wizard of Oz" series. For one of these books the authorship is disputed (there are two authors in the series, Thompson and Baum, but their contributions to one of the books are unknown).
Feinerer chops up the the documents into 500-line chunks to build a TermDocumentMatrix, then performs variants of Principal Component Analysis (PCA), one with TfIDF, on the Matrix, and shows by visual inspection of the PCA plots that the disputed book tends to be authored by Thompson. But parts might have been written by Baum.
In the plot this is indicated by points inside the pinkish wiggly oval (drawn by me). The green dots are chunks from a book with known authorship (T.) and the yellow dots are from the book of unknown / disputed authorship. (The points fall close together in the plot. That's the evidence here; it's qualitative, but this is just one example of many in the PDF)
The Tf-IDF PCA plot on page 95 looks similar.
I have not given any R code because I don't know if you like R, and this post is already getting too long, and you can read it up yourself in the PDF.
(And I don't know any implementations of Tf-IDF in Javascript).