0
votes

First of all i have to mention that i mean document clustering as a data mining technique, not a workload clustering or something like that.

From the beginning i will say what i have:

  • I get documents all the time. Let's assume those are news (It's rather similar thing).
  • Every time i get new batch of "news" i should add them to Solr index and get cluster information for that document. Store this information in the DB (so i should know each document's cluster).
  • I can't wait for cluster definition service/program to launch from time to time, but it should define clusters on the fly.
  • I want to be able to get clusters only for some period of time (For example i want to search for clusters only for documents that were loader one month ago).
  • I will have tens of thousands of new documents every day and overall base of several millions.

Long time ago i've been using some library (can't remember it's name), it recieved document as an input, and resulted cluster id, if it thought it's a new cluster then it created one, and so on. But it worked slowly (and i can't even remember the name of it).

I've found a book about Mahout, but still can't figure out what should i read and what is what i want. And, maybe, it's impossible to do that with Solr/Mahout without writing own plugins for Solr.

I will appreciate any thoughts, advices on how to build such system.

Thanks, in advance

2

2 Answers

0
votes

I don't think you need any type of custom Solr plugin. That's because the classification for new documents can be determined during the normal indexing processes of your "news" and therefore you can just add it as a normal field to every Solr document.

When it comes to clustering and classification with Mahout, I'd say the Mahout in Action book is a good resource to start with.

Cheers.

0
votes

Rather a old post, nevertheless let me respond, you can use carrot2 http://project.carrot2.org/index.html for solr result clustering. This is always on the fly.