Solr: stemming breaks highlighting

Question

I changed some of my fields from text_general to text_en, hoping to take advantage of stemming and some other improvements, but unfortunately the change has broken highlighting. It seems that it only wants to highlight non-stemmed words (i.e. words whose stemmed version is the same as the word itself, like "child").

I'm using the default fieldType definition:

 <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.StopFilterFactory"
             ignoreCase="true"
             words="lang/stopwords_en.txt"
             />
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.EnglishPossessiveFilterFactory"/>
     <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
     <filter class="solr.PorterStemFilterFactory"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
     <filter class="solr.StopFilterFactory"
             ignoreCase="true"
             words="lang/stopwords_en.txt"
             />
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.EnglishPossessiveFilterFactory"/>
     <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
     <filter class="solr.PorterStemFilterFactory"/>
   </analyzer>
 </fieldType>

And enabling highlighting with hl.fl=title&hl=true in my query. This is also a faceted search, if that matters.

In this case, as I said, only unstemmed words like "child" are highlighted. If I remove the stemming filter from the index analyzer (only, the query analyzer seems to have no effect) in the text_en definition, all matched words except stopwords are highlighted. Furthermore, if I change text_en to use the EnglishMinimalStemFilterFactory, more words are highlighted, which I assume is because they are stemmed by the Porter stemmer but not by this one. An example of such a word is "strides".

Does anyone know what's going on?

Tom O'Malley Tom O'Malley · Accepted Answer · 2017-05-02T19:10:57

I know this question is dead, but for anyone reading this here is my solution.

First, note that this behavior only happens if you are using "hlq". If you are using "q" as the highlighting query as well as the search query, things should be fine. But if that's not what you need for your application, you can do this:

In your analyzer chain, for both indexing and query, add this:

<filter class="solr.KeywordRepeatFilterFactory" />

<filter class="solr.SnowballPorterFilterFactory"/>

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

Basically, this will preserve all words with the stemmed and unstemmed version, then remove the duplicates. Now highlighting will match the documents (it will still only match the EXACT phrase entered as hlq though, while q will match anything with the same stem, so there may still be some documents returned by q that have a blank highlight field)

Solr: stemming breaks highlighting

1 Answers