Word importance in lucene index

Question

hmmm, i need to get how important is the word in entire document collection that is indexed in the lucene index. I need to extract some "representable words", lets say concepts that are common and can be representable to whole collection. Or collection "keywords". I did the fulltext indexing and the only field i am using are text contents, because titles of the documents are mostly not representable(numbers, codes etc....)

EDIT: I am reading the index which contains maybe 60 documents....

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

So, I will get tfidf of this term, but for every document that we loop through. And I do not need these results:

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3); ...........and so on.
I need some measure of importance of this term in the collection. By intuition, it would be something like "if term "term1" had good tfidf in 5 documents then it is important"

But ofcourse, something smarter :)

Thank you!!!

bajafresh4life bajafresh4life · Accepted Answer · 2010-07-25T21:41:11

So, if i calculate tfidf, it gives me importance of single term with respect to single document.

Not true. IDF is measured globally across the entire corpus. The whole point of IDF is to provide a simple measure of exactly what you're looking for -- how "important" a term is.

So an easy way of doing what you ask is to find the most frequently occurring terms in the corpus, and weight them by document frequency.

Word importance in lucene index

4 Answers