22
votes

We're using Solr to search articles of various lengths. We index both descriptive metadata (title, author, category, keywords, etc) and the full article text. We do not boost relevance at index time - all boosts are done at query time (we use dismax, coupled with various qf, pf, and bf boosts).

Currently our fulltext field uses the standard omitNorms=false; and as a result, all else equal, shorter articles (2-3 column inch articles) will frequently have higher relevance than longer feature-length (multi-page) articles.

In our case article length is a significant indicator of relevance, and so I am considering setting omitNorms=true on our fulltext field.

Questions: 1. Why is the default lucene/solr behavior to boost shorter field lengths over higher? What is the reasoning? 2. Why would I not want to omitNorms? I don't need to boost queries on this particular field, nor use any kind of faceting on this field.

1

1 Answers

34
votes

Question 1:

Boosting shorter field lengths over higher field lengths has to do with a fundamental concept of determining document relevancy called TF-IDF (see http://en.wikipedia.org/wiki/Tf%E2%80%93idf). As a short example, consider your search returned two documents: the first is 100 words and the second is 1,000 words. Each contains your search keyword just once. Since the keyword in the first document was 1% of the text, the short document is judged to be more relevant to your search than the long document, where the keyword you searched for was only 0.1% of the text.

Question 2:

It sounds like based on your requirements, you might want to try omitting norms. However, this may skew your search results in ways you don't expect. It could be that you have been benefiting from some of the nice properties of length normalization and didn't realize it. Another approach might be to actually store document length as some sort of tag field such as labeling documents as "short", "medium", and "long" and then boost documents that match on long or long and medium or whatever. This would also give your end users the ability to filter on document length when they search.

Again, when I mention nice properties of length normalization, you might think of cases where a super long article exists that touches on 10 different topics, 1 of which matches the user's search or a long article exists that talks about only 1 topic, the one that was searched for. In this case, you'd probably prefer the long article over the super long article (even if the super long article matched the search keyword more times). It all depends more on your data and your use cases.