How to obtain the field norm of a document in Lucene 6?

Question

I want to compute TF-IDF scores, which are normalized by field norm, for each term in a field COMBINED_FIELD of various documents that are found via Lucene. As you can see in the code below, I'm able to get the term frequency for each term in a field of a document, I can also get the document frequency, but I cannot find a way to get the norm of this field at query time. All approaches I have found so far rely on methods that only exist in older Lucene versions, but not for Lucene 6. The way to go might be the usage of LeafReader, but I didn't find a way of getting an instance of it.

Do you have an idea how I could get the norm of the field COMBINED_FIELD for each document?

Or can I use termVector.size() as replacement for the field length? Does size() consider the number of occurences of each term or is every term counted only once?

Thanks in advance!

IndexSearcher iSearcher = null;
ScoreDoc[] docs = null;
try {
   iSearcher = this.searchManager.acquire();
   IndexReader reader = iSearcher.getIndexReader();

   MultiFieldQueryParser parser = new MultiFieldQueryParser(this.getSearchFields(), this.queryAnalyzer);

   parser.setDefaultOperator(QueryParser.Operator.OR);

   Query query = parser.parse(QueryParser.escape(searchString));            

   docs = iSearcher.search(query, maxSearchResultNumber).scoreDocs;     

   for(int i=0; i < docs.length; i++) {
      Terms termVector = reader.getTermVector(docs[i].doc, COMBINED_FIELD);

      TermsEnum itr = termVector.iterator();
      BytesRef term = null;
      PostingsEnum postings = null;

      while((term = itr.next()) != null){
         String termText = term.utf8ToString();
         postings = itr.postings(postings, PostingsEnum.FREQS);
         postings.nextDoc();

         int tf = postings.freq();
         int docFreq = reader.docFreq(new Term(COMBINED_FIELD, term));
         //HERE I WANT TO GET THE FIELD LENGTH OF THE CURRENT DOCUMENT
      }
   }
} catch (Exception e) {
   // TODO Auto-generated catch block
   e.printStackTrace();         
} finally {
   try {
      this.searchManager.release(iSearcher);
   } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
   }
}

Alternively, is there a way to get the TF-IDF or BM25 value for each term of a field directly from Lucene?

Actually, I want to have the norm which Lucene normally calculates during indexing as far as I know. However, if it is not possible, I would use the length as a proxy and would use my own norm with 1/sqrt(length). — Hans Blafoo

Mysterion Mysterion · Accepted Answer · 2017-09-04T08:26:54

Lucene internally computes the norms during indexing in the method org.apache.lucene.search.similarities.Similarity#computeNorm, then encodes it and stores on the disk in the .nvm files. Later, during querying/scoring it has been decoded only.

I think, one possible way to do it programmatically in the Lucene is to extend Similarity class and somehow get this information during indexing and store somewhere. Doesn't sounds to me like a best way to go, but at least something.

On the other hand, BM25Similarity computes the length this way:

discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();

where getLength() is the number of terms in the field, which you could calculate by iterating in the while as you do in your example.

How to obtain the field norm of a document in Lucene 6?

1 Answers