Solr community! I've spent so much time on this problem...I really need some guidance.
Problem: Phrase query with slop matches the document, but the highlighter doesn't match the snippet, even though it's in the document.
We have a big database of legal documents. When our users use search, they want to search inside of paragraphs to get more relevant results. They don't care about the documents, where the words from the query are scattered around the document. We achieved this functionality using PhraseQuery with slop on a multivalued field with a big gap between values. Each value is a separate paragraph from the text. positionIncrementGap=5000.
Here is our field configuration:
<fieldType name="text_split_by_paragraphs" class="solr.TextField" positionIncrementGap="5000" autoGeneratePhraseQueries="true">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HunspellStemFilterFactory" dictionary="uk_UA.dic" affix="uk_UA.aff"/>
</analyzer>
<similarity class="solr.BM25SimilarityFactory">
<str name="k1">0.4</str>
<str name="b">0.2</str>
</similarity>
</fieldType>
<field name="f_text_paragraphs" type="text_split_by_paragraphs" indexed="true" stored="true" multiValued="true" required="false" omitNorms="false" termVectors="true" termPositions="true" termOffsets="true" />
Let's say we have this very simple data stored in a f_text_paragraphs field. Position gap between each value is 5000, as mentioned above:
- "Apples are very delicious",
- "Apples make me happy",
Let's run some pseudo-queries.
- Query: f_text_paragraphs:"Apple delicious"~5000 - OK
- Query: f_text_paragraphs:"delicious apple"~5000 - OK
- Query: f_text_paragraphs:"delicious happy"~5000 - Fails, which is correct because words are in different values of multiValued field.
Query functionality works as expected, perfect. But the highlighting doesn't work the same way... Here are highlight settings:
hl=true&
hl.useFastVectorHighlighter=true&
hl.usePhraseHighlighter=false&
hl.highlightMultiTerm=true&
hl.fragmentsBuilder=colored& // custom builder
hl.boundaryScanner=breakIterator&
f.f_text_paragraphs.hl.fragsize=400&
f.f_text_paragraphs.hl.maxAnalyzedChars=2147483647&
f.f_text_paragraphs.hl.snippets=5&
hl.fl=f_text_paragraphs&
With hl.usePhraseHighlighter=true:
- Query: f_text_paragraphs:"Apple delicious"~5000 - OK, returns the snippet.
- Query: f_text_paragraphs:"delicious apple"~5000 - FAIL, even though search matched with the same query. Looks like when usePhraseHighlighter is set to true, SOLR preserves the word order...Which is not what we need.
With hl.usePhraseHighlighter=false:
- Query: f_text_paragraphs: "Apple delicious"~5000 - OK, but it also adds the second value "Apples make me happy" because for some reason SOLR ignored slop and matched the second value only because there is Apples word inside.
- Query: f_text_paragraphs:"delicious apple"~5000 - OK, but the same problem as above.
Question: How to make highlighter respect the slop and ignore the word order? Return snippets, where ALL words are in the same paragraph (slop=5000). We expect the same logic as with the search query.
parsedquery_toString: " boost(+(+((f_text_paragraphs:"доданий загальний вартість"~5000) ((+id:ДОДАНУ ЗАГАЛЬНУ ВАРТІСТЬ)))~0.5)
highlight:QueryToHighlight: [ [ "org.apache.lucene.search.PhraseQuery:f_text_paragraphs:"доданий загальний вартість"~5000" ],
– CliveLewis