Clustering documents Lucene

Question

I would like to implement an algorithm for clustering and implement it in Lucene. For that, I need the tf-idf term vector that represents the document, so I could represent the centroids the same way documents are represented, find the similarity between documents and clusters, and update the centroids, by calculating its new feature values. But how can I do that on top of Lucene?

Can I even get tf-idf?

I know that term frequency in each document is saved, but does that mean that I would need to calculate idf 'manually' for each term? And how to make vectors then to use them for clustering.

Thanks

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2014-08-27T07:43:10

Note that lucene uses a variation of TF-IDF as you would find it in textbooks.

You can see details here:

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

In particular, only terms used in the query are accessed. This is essentially for the performance of Lucene: read as little data from the index as necessary.

If you want access to the exact similarity value, you may want to use a Collector or some of the other expert-level APIs.

Clustering documents Lucene

2 Answers