2
votes

hmmm, i need to get how important is the word in entire document collection that is indexed in the lucene index. I need to extract some "representable words", lets say concepts that are common and can be representable to whole collection. Or collection "keywords". I did the fulltext indexing and the only field i am using are text contents, because titles of the documents are mostly not representable(numbers, codes etc....)

EDIT: I am reading the index which contains maybe 60 documents....

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

So, I will get tfidf of this term, but for every document that we loop through. And I do not need these results:

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3); ...........and so on.
I need some measure of importance of this term in the collection. By intuition, it would be something like "if term "term1" had good tfidf in 5 documents then it is important"

But ofcourse, something smarter :)

Thank you!!!

4

4 Answers

1
votes

So, if i calculate tfidf, it gives me importance of single term with respect to single document.

Not true. IDF is measured globally across the entire corpus. The whole point of IDF is to provide a simple measure of exactly what you're looking for -- how "important" a term is.

So an easy way of doing what you ask is to find the most frequently occurring terms in the corpus, and weight them by document frequency.

0
votes

You can try opening the index using Luke and it gives you the top-ranked terms.

0
votes

EDIT: I still do not get what you are trying to achieve. A high TF/IDF value means that this term is useful for differentiating this document from the rest of the collection, that is: this term is relatively more frequent in the specific document than in the collection in general. Therefore it "represents" the document against the collection background. Is this what you want?

One possible way to rephrase your question is that you want to compress the collection, using few high-frequency terms. This means words that appear a lot in the collection, and can be done by take words having low idf.

Another alternative is that you want some concise way to represent the collection against a more general background, say a larger collection or the whole WWW. In that case, you want to compare word frequency between collections, consider the mutual information between the word type and the collection, or other feature selection methods.

If I still miss your point, please say so.

0
votes

The contrib/ folder has a class to generate a list of the most frequent terms: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java

If you're instead looking for semantic feature extraction, you can check out http://project.carrot2.org/