I want to compute TF-IDF scores, which are normalized by field norm, for each term in a field COMBINED_FIELD of various documents that are found via Lucene. As you can see in the code below, I'm able to get the term frequency for each term in a field of a document, I can also get the document frequency, but I cannot find a way to get the norm of this field at query time. All approaches I have found so far rely on methods that only exist in older Lucene versions, but not for Lucene 6. The way to go might be the usage of LeafReader, but I didn't find a way of getting an instance of it.
Do you have an idea how I could get the norm of the field COMBINED_FIELD for each document?
Or can I use termVector.size() as replacement for the field length? Does size() consider the number of occurences of each term or is every term counted only once?
Thanks in advance!
IndexSearcher iSearcher = null;
ScoreDoc[] docs = null;
try {
iSearcher = this.searchManager.acquire();
IndexReader reader = iSearcher.getIndexReader();
MultiFieldQueryParser parser = new MultiFieldQueryParser(this.getSearchFields(), this.queryAnalyzer);
parser.setDefaultOperator(QueryParser.Operator.OR);
Query query = parser.parse(QueryParser.escape(searchString));
docs = iSearcher.search(query, maxSearchResultNumber).scoreDocs;
for(int i=0; i < docs.length; i++) {
Terms termVector = reader.getTermVector(docs[i].doc, COMBINED_FIELD);
TermsEnum itr = termVector.iterator();
BytesRef term = null;
PostingsEnum postings = null;
while((term = itr.next()) != null){
String termText = term.utf8ToString();
postings = itr.postings(postings, PostingsEnum.FREQS);
postings.nextDoc();
int tf = postings.freq();
int docFreq = reader.docFreq(new Term(COMBINED_FIELD, term));
//HERE I WANT TO GET THE FIELD LENGTH OF THE CURRENT DOCUMENT
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
this.searchManager.release(iSearcher);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Alternively, is there a way to get the TF-IDF or BM25 value for each term of a field directly from Lucene?