I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,
I couldn't find any example how to do it with latest Lucene (4.x.x).
Could you help me?
I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,
I couldn't find any example how to do it with latest Lucene (4.x.x).
Could you help me?
To iterate through terms in the index, you'll want to use Fields
and Terms
. Terms
exposes the docfreq()
for your idf calculation. Of course, IndexReader
itself exposes the numDocs()
. You can use DefaultSimilarity.idf
to perform the calculations for you, rather than rolling your own.
DefaultSimilarity similarity = new DefaultSimilarity();
int docnum = reader.numDocs();
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
while (termsEnum.next() != null) {
double idf = similarity.idf(termsEnum.docFreq(), docnum);
System.out.println("" + field + ":" + termsEnum.term().utf8ToString() + " idf=" + idf);
}
}
tf is only relevant to the term with regards to a specific document, so not quite sure what you are looking for there.
for (String field : fields)
{
if( field.equals("contents") )
{
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
while (termsEnum.next() != null)
{
// double idf = similarity.idf(termsEnum.docFreq(), docnum);
double idf = Math.log(docnum / termsEnum.docFreq()); // idf = log(D/dt)
System.out.println("" + field + ":" + termsEnum.term().utf8ToString() +" fr = "+termsEnum.docFreq() + " idf=" + idf);
}
}
else
{
System.out.println("fin");
}
}
because idf(t, D) = log (N \ (d in D: t in d))
N: total number of documents in the corpus
d in D: t in d : number of documents where the term t appears