3
votes

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene

I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:

  1. "boost" that shorter length posts - using doc.getBoost()
  2. "lengthNorm" in the definition of norm(t,d)

Here is the documentation

I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

2
Look into custom Similarity implementations (derive from DefaultSimilarity and override LengthNorm, Tf, Idf and other functions used for score calculations), it may help you to understand the process further.sisve
We had the same effect and it worked well with Field.setOmitNorms(true) setting the similarity to searcher.setSimilarity(new DefaultSimilarity() { @Override public float tf(float freq) { return 1; } }); this switched off counting terms and taking document length into account.fricke

2 Answers

1
votes

Using BM25Similarity you could reduce to 0f:

@param b Controls to what degree document length normalizes tf values

or

@param k1 Controls non-linear term frequency normalization (saturation).

Both params will affect SimWeight

indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));

More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

0
votes

Shorter docs are meant to be more relevant when you use TF-IDF scoring.

You can use your custom scoring functions in Lucene. Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.

There's a code sample here that will help you implement it