1
votes

I would like to implement an algorithm for clustering and implement it in Lucene. For that, I need the tf-idf term vector that represents the document, so I could represent the centroids the same way documents are represented, find the similarity between documents and clusters, and update the centroids, by calculating its new feature values. But how can I do that on top of Lucene?

Can I even get tf-idf?

I know that term frequency in each document is saved, but does that mean that I would need to calculate idf 'manually' for each term? And how to make vectors then to use them for clustering.

Thanks

2

2 Answers

0
votes

Note that lucene uses a variation of TF-IDF as you would find it in textbooks.

You can see details here:

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

In particular, only terms used in the query are accessed. This is essentially for the performance of Lucene: read as little data from the index as necessary.

If you want access to the exact similarity value, you may want to use a Collector or some of the other expert-level APIs.

0
votes

You could store term vectors for the field. Then for particular document you can get the term vector:

Terms termFreqVector = indexReader.getTermVector(doc, field);
TermsEnum te = termFreqVector.iterator(null);

and then iterating over the enum for each term you can use methods:

long df = te.docFreq(); // df of the term
long tf = te.totalTermFreq(); // tf of the term

to obtain idf you can divide the df by indexReader.numDocs() and apply Math.log

Of course you also can use Mahout tools for vectorizing lucene documents: http://mahout.apache.org/users/basics/creating-vectors-from-text.html