0
votes

Solr community! I've spent so much time on this problem...I really need some guidance.

Problem: Phrase query with slop matches the document, but the highlighter doesn't match the snippet, even though it's in the document.

We have a big database of legal documents. When our users use search, they want to search inside of paragraphs to get more relevant results. They don't care about the documents, where the words from the query are scattered around the document. We achieved this functionality using PhraseQuery with slop on a multivalued field with a big gap between values. Each value is a separate paragraph from the text. positionIncrementGap=5000.

Here is our field configuration:

    <fieldType name="text_split_by_paragraphs" class="solr.TextField" positionIncrementGap="5000" autoGeneratePhraseQueries="true">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="uk_UA.dic" affix="uk_UA.aff"/> 
    </analyzer>
    <similarity class="solr.BM25SimilarityFactory">
          <str name="k1">0.4</str>
          <str name="b">0.2</str>
    </similarity>
</fieldType>

<field name="f_text_paragraphs" type="text_split_by_paragraphs" indexed="true" stored="true" multiValued="true" required="false" omitNorms="false" termVectors="true" termPositions="true" termOffsets="true" />

Let's say we have this very simple data stored in a f_text_paragraphs field. Position gap between each value is 5000, as mentioned above:

  • "Apples are very delicious",
  • "Apples make me happy",

Let's run some pseudo-queries.

  • Query: f_text_paragraphs:"Apple delicious"~5000 - OK
  • Query: f_text_paragraphs:"delicious apple"~5000 - OK
  • Query: f_text_paragraphs:"delicious happy"~5000 - Fails, which is correct because words are in different values of multiValued field.

Query functionality works as expected, perfect. But the highlighting doesn't work the same way... Here are highlight settings:

hl=true&
hl.useFastVectorHighlighter=true&
hl.usePhraseHighlighter=false&
hl.highlightMultiTerm=true&
hl.fragmentsBuilder=colored& // custom builder
hl.boundaryScanner=breakIterator&
f.f_text_paragraphs.hl.fragsize=400&
f.f_text_paragraphs.hl.maxAnalyzedChars=2147483647&
f.f_text_paragraphs.hl.snippets=5&
hl.fl=f_text_paragraphs&

With hl.usePhraseHighlighter=true:

  • Query: f_text_paragraphs:"Apple delicious"~5000 - OK, returns the snippet.
  • Query: f_text_paragraphs:"delicious apple"~5000 - FAIL, even though search matched with the same query. Looks like when usePhraseHighlighter is set to true, SOLR preserves the word order...Which is not what we need.

With hl.usePhraseHighlighter=false:

  • Query: f_text_paragraphs: "Apple delicious"~5000 - OK, but it also adds the second value "Apples make me happy" because for some reason SOLR ignored slop and matched the second value only because there is Apples word inside.
  • Query: f_text_paragraphs:"delicious apple"~5000 - OK, but the same problem as above.

Question: How to make highlighter respect the slop and ignore the word order? Return snippets, where ALL words are in the same paragraph (slop=5000). We expect the same logic as with the search query.

1
can you add &debugQuery=true (or &debug=true) and print the parse tree of your query?D_K
@D_K Sure. As a side note, we use custom query parser so I always know what queries are generated. Here is a query and highlight query (I removed all the fields and only left the one related to this issue) parsedquery_toString: " boost(+(+((f_text_paragraphs:"доданий загальний вартість"~5000) ((+id:ДОДАНУ ЗАГАЛЬНУ ВАРТІСТЬ)))~0.5) highlight: QueryToHighlight: [ [ "org.apache.lucene.search.PhraseQuery:f_text_paragraphs:"доданий загальний вартість"~5000" ],CliveLewis
@D_K Search query matches the document, but the snippet is empty. If I replace the query with "загальний доданий вартість", it will show me the snippet (because in the text words are in the same order)CliveLewis
not sure, if this will help, two things: 1. do you use solr.apache.org/guide/8_9/… ? Which is language specific 2. can you make sure the vectors have indexed correctly using luke?D_K
actually, I think I might know what is happening. Since you have position gap of 5000, the highlighter might not work correctly. I've been doing custom query parsing and had to implement adjustments to the highlighter as well. In any case you can try to vary hl.phraseLimit "The maximum number of phrases to analyze when searching for the highest-scoring phrase. The default is 5000."D_K

1 Answers

0
votes

Lucene/Solr's "Unified Highlighter" is generally the best one. It is the most accurate highlighter and was in fact built/designed in the same search domain as you are working (legal search) where accuracy of highlights is very important. Use hl.method=unified to switch to it. Be sure to check out the documentation for lots of details. (Switch to the docs for your Solr version as appropriate).

BTW the Solr community is best found in it's user list and Slack; not so much Stackoverflow.