First of all i have to mention that i mean document clustering as a data mining technique, not a workload clustering or something like that.
From the beginning i will say what i have:
- I get documents all the time. Let's assume those are news (It's rather similar thing).
- Every time i get new batch of "news" i should add them to Solr index and get cluster information for that document. Store this information in the DB (so i should know each document's cluster).
- I can't wait for cluster definition service/program to launch from time to time, but it should define clusters on the fly.
- I want to be able to get clusters only for some period of time (For example i want to search for clusters only for documents that were loader one month ago).
- I will have tens of thousands of new documents every day and overall base of several millions.
Long time ago i've been using some library (can't remember it's name), it recieved document as an input, and resulted cluster id, if it thought it's a new cluster then it created one, and so on. But it worked slowly (and i can't even remember the name of it).
I've found a book about Mahout, but still can't figure out what should i read and what is what i want. And, maybe, it's impossible to do that with Solr/Mahout without writing own plugins for Solr.
I will appreciate any thoughts, advices on how to build such system.
Thanks, in advance