Lucene - getting document frequency - termsEnum.docFreq() always returns 1

Question

i am currently trying to calculate a tf-idf matrix for the terms in a lucene index. I try to do this with the following function:

public Table<Integer, BytesRef, Double> tfidf(String field) throws IOException, ParseException{
    //variables in complete context
    int totalNoOfDocs = reader.numDocs();                                   //total no of docs
    HashBasedTable<Integer, BytesRef, Double> tfidfPerDocAndTerm = HashBasedTable.create(); //tfidf value for each document(integer) and term(Byteref) pair.

    //variables in loop context
    BytesRef    term;                                                       //term as BytesRef
    int         noOfDocs;                                                   //number of documents (a term occours in)
    int         tf;                                                         //term frequency (of a term in a doc)
    double      idf;                                                        //inverse document frequency (of a term in a doc)
    double      tfidf;                                                      //term frequency - inverse document frequency value (of a term in a doc)
    Terms       termVector;                                                 //all terms of current doc in current field
    TermsEnum   termsEnum;                                                  //iterator for terms
    DocsEnum    docsEnum;                                                   //iterator for documents (of current term)

    List<Integer> docIds = getDocIds(totalNoOfDocs);                        //get internal documentIds of documents

    try {
        for(int doc : docIds){
            termVector  = reader.getTermVector(doc, field);                 //get termvector for document
            termsEnum   = termVector.iterator(null);                        //get iterator of termvector to iterate over terms


            while((term = termsEnum.next()) != null){                       //iterate of terms

                    noOfDocs = termsEnum.docFreq();                         //add no of docs the term occurs in to list

                    docsEnum = termsEnum.docs(null, null);                  //get document iterator for this term (all documents the term occours in)
                    while((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS){ //iterate over documents - computation of all tf-idf values for this term
                        tf      = docsEnum.freq();                          //get termfrequency of current term in current doc
                        idf     = Math.log((double)totalNoOfDocs / (double)noOfDocs); //calculate idf
                        tfidf   = (double) tf * idf;                        //caculate tfidf
                        tfidfPerDocAndTerm.put(doc, term, tfidf);           //add tf-idf value to matrix

                    }
            }
        }

    } catch (IOException ex) {
        Logger.getLogger(Index.class.getName()).log(Level.SEVERE, null, ex);
    }   
    return tfidfPerDocAndTerm;
}

The Problem is: noOfDocs = termsEnum.docFreq(); always returns 1. Even through there are obviously terms which occur in more than one document (checked it manually by printing "term").

I also found out, that the docsEnum i retrieve with : docsEnum = termsEnum.docs(null, null); does always only contain 1 document (doc 0).

When creating the index I used a standard analyzer with a stop word list, so all terms are lowercased.

So whats my problem ? :/

Thanks for your Help!

You should first use Luke to see if the index looks like it should. — Daniel Naber
I do this on a RAMDirectory index - but changed to a "real" directory for opening in Luke. Luke says: Format version is not supported... when trying to open the index. ?? (I am using lucen 4.0) — dburgmann
Download the newest version code.google.com/p/luke/downloads/… — Rob Audenaerde
I did use lukeall-4.0.0-ALPHA.jar. That seems to be the newest version. — dburgmann

Aries Aries · Accepted Answer · 2013-10-30T06:19:57

Actually your term, which is in is BytesRef type, is looping, instead of your termsenums, but unfortunately, BytesRef does not support a method called freq() or docfreq()

Lucene - getting document frequency - termsEnum.docFreq() always returns 1

2 Answers