2
votes

I want to compute TF-IDF scores, which are normalized by field norm, for each term in a field COMBINED_FIELD of various documents that are found via Lucene. As you can see in the code below, I'm able to get the term frequency for each term in a field of a document, I can also get the document frequency, but I cannot find a way to get the norm of this field at query time. All approaches I have found so far rely on methods that only exist in older Lucene versions, but not for Lucene 6. The way to go might be the usage of LeafReader, but I didn't find a way of getting an instance of it.

Do you have an idea how I could get the norm of the field COMBINED_FIELD for each document?

Or can I use termVector.size() as replacement for the field length? Does size() consider the number of occurences of each term or is every term counted only once?

Thanks in advance!

IndexSearcher iSearcher = null;
ScoreDoc[] docs = null;
try {
   iSearcher = this.searchManager.acquire();
   IndexReader reader = iSearcher.getIndexReader();

   MultiFieldQueryParser parser = new MultiFieldQueryParser(this.getSearchFields(), this.queryAnalyzer);

   parser.setDefaultOperator(QueryParser.Operator.OR);

   Query query = parser.parse(QueryParser.escape(searchString));            

   docs = iSearcher.search(query, maxSearchResultNumber).scoreDocs;     

   for(int i=0; i < docs.length; i++) {
      Terms termVector = reader.getTermVector(docs[i].doc, COMBINED_FIELD);

      TermsEnum itr = termVector.iterator();
      BytesRef term = null;
      PostingsEnum postings = null;

      while((term = itr.next()) != null){
         String termText = term.utf8ToString();
         postings = itr.postings(postings, PostingsEnum.FREQS);
         postings.nextDoc();

         int tf = postings.freq();
         int docFreq = reader.docFreq(new Term(COMBINED_FIELD, term));
         //HERE I WANT TO GET THE FIELD LENGTH OF THE CURRENT DOCUMENT
      }
   }
} catch (Exception e) {
   // TODO Auto-generated catch block
   e.printStackTrace();         
} finally {
   try {
      this.searchManager.release(iSearcher);
   } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
   }
}

Alternively, is there a way to get the TF-IDF or BM25 value for each term of a field directly from Lucene?

1
so, you want to compute norm or get length?Mysterion
Actually, I want to have the norm which Lucene normally calculates during indexing as far as I know. However, if it is not possible, I would use the length as a proxy and would use my own norm with 1/sqrt(length).Hans Blafoo

1 Answers

1
votes

Lucene internally computes the norms during indexing in the method org.apache.lucene.search.similarities.Similarity#computeNorm, then encodes it and stores on the disk in the .nvm files. Later, during querying/scoring it has been decoded only.

I think, one possible way to do it programmatically in the Lucene is to extend Similarity class and somehow get this information during indexing and store somewhere. Doesn't sounds to me like a best way to go, but at least something.

On the other hand, BM25Similarity computes the length this way:

discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();

where getLength() is the number of terms in the field, which you could calculate by iterating in the while as you do in your example.