Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

Question

Suppose there is the following mapping with Edge NGram Tokenizer:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "tokenizer": "autocomplete_tokenizer",
          "filter": [
            "standard"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "whitespace"
        }
      },
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "symbol"
          ]
        }
      }
    }
  },
  "mappings": {
    "tag": {
      "properties": {
        "id": {
          "type": "long"
        },
        "name": {
          "type": "text",
          "analyzer": "autocomplete_analyzer",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

And the following documents are indexed:

POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}

Then searching

{
  "query": {
    "match": {
      "name": {
        "query": "HI"
      }
    }
  }
}

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": {
    "pre_tags": [
      "<"
    ],
    "post_tags": [
      ">"
    ],
    "fields": {
      "name": {}
    }
  }

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

Thomas Decaux Thomas Decaux · Accepted Answer · 2018-11-12T16:54:39

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{
  "query": {
    "bool": {
            "must" : [
                        {
          "match": {
            "name": "HI"
          }
        }
            ],
      "should": [
        {
          "prefix": {
            "name": "HI"
          }
        }
      ]
    }
  },
     "highlight": {
    "pre_tags": [
      "<"
    ],
    "post_tags": [
      ">"
    ],
    "fields": {
      "name": {
                "highlight_query": {
                        "match": {
            "name": "HI"
          }
                }
            }
    }
  }
}

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

3 Answers