solr 8.4.1 search ignore duplicate phrase

Question

We are using Solr 8.4.1 to search the products from documents. I want exact phrase to come on top, but I also want if same phrase is repeated many times in document then it should only be counted once. Right now those keywords having same phrase multiple times in document comes on top because they're getting a higher score.

Please see the result below given i am searching for "pipes". Seven results found, but prod_id:297720 named my pipes pipes have a different score from prod_id:3064.

As per our need both should have the same score. I want to ignore the repeated phrase found in the documents.

We are using similarity class <similarity class="solr.BM25SimilarityFactory"/>

Searching field schema given below as

<field name="product_related_kword" type="text_general" indexed="true" stored="true" />

And the result is:

{
  "responseHeader":{
    "status":0,
    "QTime":222,
    "params":{
      "q":"product_related_kword:pipes",
      "fl":"product_name,score,product_related_kword,prod_id,member_classified_slno,member_id",
      "start":"0",
      "rows":"100",
      "debugQuery":"on"}},
  "response":{"numFound":7,"start":0,"maxScore":2.7593598,"docs":[
      {
        "prod_id":297720,
        "product_name":"my pipes pipes",
        "member_classified_slno":123457327,
        "member_id":"11111327",
        "product_related_kword":"my pipes pipes 00",
        "score":2.7593598},
      {
        "prod_id":3064,
        "product_name":"pipes",
        "member_classified_slno":123457560,
        "member_id":"11119579",
        "product_related_kword":"pipes 00",
        "score":2.5436506},
      {
        "prod_id":3064,
        "product_name":"pipes",
        "member_classified_slno":123457544,
        "member_id":"11113186",
        "product_related_kword":"pipes 00",
        "score":2.5436506},
      {
        "prod_id":3064,
        "product_name":"pipes",
        "member_classified_slno":123457546,
        "member_id":"11113636",
        "product_related_kword":"pipes 00",
        "score":2.5436506},
      {
        "prod_id":3064,
        "product_name":"pipes",
        "member_classified_slno":123457551,
        "member_id":"11119238",
        "product_related_kword":"pipes 00",
        "score":2.5436506},
      {
        "prod_id":3064,
        "product_name":"pipes",
        "member_classified_slno":123457553,
        "member_id":"785565531",
        "product_related_kword":"pipes 00",
        "score":2.5436506
       }
    ]
  },

I'm not sure how it behaves for phrase matches, but could you try using the constant score operator? Another option is to use a custom similarity based on BM25 but with tfalways returning 1. — MatsLindh

raghu777 raghu777 · Accepted Answer · 2020-06-03T06:23:59

Add the following filter to your index analyzer.

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

This will remove the duplicate tokens from the stream. If you have synonyms then add this after Synonym Filter for better results.

Check RemoveDuplicatesTokenFilterFactory on the following page

https://lucene.apache.org/solr/guide/8_4/filter-descriptions.html

solr 8.4.1 search ignore duplicate phrase

1 Answers