0
votes

I have indexed my documents text using the following config in solr:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> -->
            <filter class="solr.LowerCaseFilterFactory" />              
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
</fieldType>

<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>

And a test query

desc:Alabama Crimson Tide Toddler Crimson Team Logo Flannel Pajama Pants

Returns the first 2 results that look like:

{

"id":"_:node1b897e5ffccc354e5da5128066e2e9e4|https://www.crookscountry.com/product/alabama-greatest-hits",
    "name":"Alabama - Greatest Hits",
    "source_entity_index":"prod03",
    "category":"",
    "category_str":"",
    "desc":"Alabama ~ Alabama - Greatest Hits",
    "host":"www.crookscountry.com",
    "url":"https://www.crookscountry.com/product/alabama-greatest-hits",
    "_version_":1652845859059007489},
  {
    "id":"_:noded8c4ca8e98bb12e1132af18c76f277b|https://shop.spreadshirt.com/thatshirtcray/amateur+sketch+shirt-A12174934",
    "name":"Amateur Sketch Shirt | Men's T-Shirt",
    "source_entity_index":"prod03",
    "category":"",
    "category_str":"",
    "desc":"Leprechaun in Alabama amateur sketch.",
    "host":"shop.spreadshirt.com",
    "url":"https://shop.spreadshirt.com/thatshirtcray/amateur+sketch+shirt-A12174934",
    "_version_":1652846254331265025},

But the documents I really want to rank high are ranked even after top 100, e.g.:

{
        "id":"_:nodec65a89504cb5f3af808caf654ac7cb72|http://shop.rolltide.com/Alabama_Crimson_Tide_Sweatshirts_And_Fleece_Sweaters",
        "host":"shop.rolltide.com",
        "name":"Men's Crimson Alabama Crimson Tide Big Logo Sweater",
        "text":"Show off your team spirit with this Alabama Crimson Tide Big Logo sweater.",
        "_version_":1646377538225700866},
      {
        "id":"_:nodeebc0adb5a11937556ebdf77132fab580|http://shop.foxsports.com/FOX_Alabama_Crimson_Tide_Sweaters_And_Dress_Shirts",
        "host":"shop.foxsports.com",
        "name":"Men's Crimson Alabama Crimson Tide Big Logo Sweater",
        "text":"Show off your team spirit with this Alabama Crimson Tide Big Logo sweater.",
        "_version_":1646383652576165892},

I do not quite understand how the default solr ranking works... it seems that it favours short text, even if there is only one overlapping word with the query. Is there anyway I can change this based on my needs?

Much appreciated!

1
Be aware that desc:Alabama Crimson Tide Toddler Crimson Team Logo Flannel Pajama Pants searches for Alabama in desc, but the rest of the terms are searched in the default search field. Seeing as the two documents you want higher doesn't even have a desc field, it's hard to say exactly why the score is what it is - append debug=all to your query to see how each document is scored (i.e. which terms contribute what to the total score). Using the edismax handler (defType=edismax) with qf and explicit field weights usually give you a better result.MatsLindh
Thank you I am trying Eric's idea but this is new to me... we perhaps I should have picked this up when learning solr...Ziqi

1 Answers

1
votes

Solr document ranking relies on Lucene Similarity.

it seems that it favours short text, even if there is only one overlapping word with the query

This behavior is due to the field length normalization. You can set omitNorms=true to disable field length normalization (cf. https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#field-default-properties).

See this post for a more in-depth explanation.

Alternatively/additionally with (e)dismax parser you may play with the mm (aka MinimumShouldMatch) parameter to tweak - not the ranking - but how Solr matches documents.