lucene 4.10.2 calculate tf-idf for all terms in index

Question

I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,

I couldn't find any example how to do it with latest Lucene (4.x.x).

Could you help me?

i will use lucene for indexing xml documents (content only) i would calculat tf-id and use them in classification of documents collection by Kohonen network — tommy

femtoRgon femtoRgon · Accepted Answer · 2015-01-30T22:21:58

To iterate through terms in the index, you'll want to use Fields and Terms. Terms exposes the docfreq() for your idf calculation. Of course, IndexReader itself exposes the numDocs(). You can use DefaultSimilarity.idf to perform the calculations for you, rather than rolling your own.

DefaultSimilarity similarity = new DefaultSimilarity();
int docnum = reader.numDocs();
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
    Terms terms = fields.terms(field);
    TermsEnum termsEnum = terms.iterator(null);
    while (termsEnum.next() != null) {
        double idf = similarity.idf(termsEnum.docFreq(), docnum);
        System.out.println("" + field + ":" + termsEnum.term().utf8ToString() + " idf=" + idf);
    }
}

tf is only relevant to the term with regards to a specific document, so not quite sure what you are looking for there.

lucene 4.10.2 calculate tf-idf for all terms in index

2 Answers