Better text documents clustering than tf/idf and cosine similarity?

Question

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

My question: is there better techniques to cluster documents?

@ThomasJungblut well, TF-IDF is supposed to be a weighting scheme that puts more weight on relevant keywords already. If figure the problem is that tweets are just so tiny text fragments, you can't expect similarity to work very well on them beyond "near identity". Most tweets aren't even complete sentences, so NLP will likely also fail. — Has QUIT--Anony-Mousse
One thing to watch with LSI / LDA / NMF etc. is topic drift. Training a model on a known dataset will yield good results if your pipeline isn't done correctly. If you then apply your model to a totally unseen dataset you may see significant drop in performance due to fitting the original training data. Because Twitter text is so short the representation will need a bit of fiddling with as there may not be enough text to train a model properly. — Steve

Fred Foo Fred Foo · Accepted Answer · 2013-07-09T08:17:52

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

Topic models such as LDA might work even better.

Better text documents clustering than tf/idf and cosine similarity?

3 Answers