Hibernate Search + Lucene: Fallback-search for stop words

Question

I use Hibernate Search in version 5.11.5 together with Apache Lucene 5.5.5. In my example I use the StopFilterFactory with the default stop word set defined in StopAnalyzer.ENGLISH_STOP_WORDS_SET (so e.g. "this", "will", "be", ...).

Now I index three music song titles: "I will survive", "we will rock you", "this will be"

My search query is "Rock will make me survive". So I find "I will survive" and "we will rock you", but not "this will be", because this song consists completely of stop words. If I search for "this will be", then I find nothing.

Now I need a "fallback" search for these songs: If and only if a song title consists completely of stop words, then I would like to find them if all words are contained in my search string. So searching for "I will be a fireman" does not find "this will be", but searching for "I will be like this" finds "this will be".

Do you know how I can achieve this?

yrodiere yrodiere · Accepted Answer · 2021-02-22T10:30:21

Personally, in such a situation I would consider simply doing away with StopFilterFactory.

The main problem with stop words is that they appear very frequently in many documents, and thus they affect the relevance (score) in a way that's completely out of proportion considering they don't have much meaning.

So we generally don't index them at all, to work around the problem. As a bonus, this may reduce the index size to some extent.

But there's another solution, which is to keep stop words and fix how scores are computed. In Lucene, the component responsible for computing scores is called the Similarity. The default one in Hibernate Search 5 / Lucene 5.5 is ClassicSimilarity, which suffers from this problem with stop words. Another more recent implementation is BM25, and that implementation has much better behavior when it comes to stop words: it does not let them affect the score as much. You can find an in-depth explanation here, if you are interested. Note that BM25 replaced ClassicSimilarity as the default similarity in more recent versions of Lucene and Hibernate Search, as well as in Elasticsearch.

I'd suggest you change the Similarity to use org.apache.lucene.search.similarities.BM25Similarity, remove your stop-word filters, then reindex your data, then test your queries again. Are you getting relevant hits near the top? Is the index size still manageable? Is your query "this will be" matching something? If so, switching to BM25 is a completely viable solution.

Note that you can also consider upgrading to Hibernate Search 6 which uses BM25 by default.

Hibernate Search + Lucene: Fallback-search for stop words

1 Answers