I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.
The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:
1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.
The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:
1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.
Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.
My question: is there better techniques to cluster documents?