2
votes

I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,

I couldn't find any example how to do it with latest Lucene (4.x.x).

Could you help me?

2
i will use lucene for indexing xml documents (content only) i would calculat tf-id and use them in classification of documents collection by Kohonen networktommy

2 Answers

2
votes

To iterate through terms in the index, you'll want to use Fields and Terms. Terms exposes the docfreq() for your idf calculation. Of course, IndexReader itself exposes the numDocs(). You can use DefaultSimilarity.idf to perform the calculations for you, rather than rolling your own.

DefaultSimilarity similarity = new DefaultSimilarity();
int docnum = reader.numDocs();
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
    Terms terms = fields.terms(field);
    TermsEnum termsEnum = terms.iterator(null);
    while (termsEnum.next() != null) {
        double idf = similarity.idf(termsEnum.docFreq(), docnum);
        System.out.println("" + field + ":" + termsEnum.term().utf8ToString() + " idf=" + idf);
    }
}

tf is only relevant to the term with regards to a specific document, so not quite sure what you are looking for there.

0
votes
for (String field : fields)
{ 
if( field.equals("contents") )
 { 
 Terms terms = fields.terms(field);
    TermsEnum termsEnum = terms.iterator(null);

 while (termsEnum.next() != null)  
        {

           // double idf = similarity.idf(termsEnum.docFreq(), docnum);

            double idf = Math.log(docnum  / termsEnum.docFreq()); // idf = log(D/dt)

            System.out.println("" + field + ":" + termsEnum.term().utf8ToString() +" fr = "+termsEnum.docFreq() + " idf=" + idf);
        }
   }
     else 
    {
     System.out.println("fin");
    }
      }

because idf(t, D) = log (N \ (d in D: t in d))

N: total number of documents in the corpus

d in D: t in d : number of documents where the term t appears