3
votes

The Problem

I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:

  1. Prefix match at start of "Name" (Prefix query)
  2. Any other exact (whole word) match within "Name" (Term query)
  3. Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)

My Attempted Solution

I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.

The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':

{ "query": { "prefix": { "Name.raw" : "Harry" } } }

I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?

From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:

  1. using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)

  2. using a standard analyzer to enable the Term (I have applied this on the Name field)

I have checked duplicate questions such as this one but the answers have not helped

My mapping and settings are below

ES Index Mapping

{
    "myIndex": {
        "mappings": {
            "pages": {
                "properties": {
                    "Id": {},
                    "Name": {
                        "type": "text",
                        "fields": {
                            "raw": {
                                "type": "text",
                                "analyzer": "keywordAnalyzer",
                                "search_analyzer": "pageSearchAnalyzer"
                            }
                        },
                    "analyzer": "pageSearchAnalyzer"
                    },
                    "Tokens": {}, // Other fields not important for this question
                }
            }
        }
    }
}

ES Index Settings

{
    "myIndex": {
        "settings": {
            "index": {
                "analysis": {
                    "filter": {
                        "ngram": {
                            "type": "edgeNGram",
                            "min_gram": "2",
                            "max_gram": "15"
                        }
                    },
                    "analyzer": {
                        "keywordAnalyzer": {
                            "filter": [
                                "trim",
                                "lowercase",
                                "asciifolding"
                            ],
                            "type": "custom",
                            "tokenizer": "keyword"
                        },
                        "pageSearchAnalyzer": {
                            "filter": [
                                "trim",
                                "lowercase",
                                "asciifolding"
                            ],
                            "type": "custom",
                            "tokenizer": "standard"
                        },
                        "pageIndexAnalyzer": {
                            "filter": [
                                "trim",
                                "lowercase",
                                "asciifolding",
                                "ngram"
                                ],
                            "type": "custom",
                            "tokenizer": "standard"
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "l2AXoENGRqafm42OSWWTAg",
                "version": {}
            }
        }
    }
}
2

2 Answers

1
votes

Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.

In your case here, you'll need to do one of a few different things:

  1. Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
  2. Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs

Here's an example of the latter:

1) Create the index w/ ngram analyzer and (recommended) standard search analyzer

PUT my_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "ngram": {
            "type": "edgeNGram",
            "min_gram": "2",
            "max_gram": "15"
          }
        },
        "analyzer": {
          "pageIndexAnalyzer": {
            "filter": [
              "trim",
              "lowercase",
              "asciifolding",
              "ngram"
            ],
            "type": "custom",
            "tokenizer": "keyword"
          }
        }
      }
    }
  },
  "mappings": {
    "pages": {
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "pageIndexAnalyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

2) Index some sample docs

POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}

3) Run the a match query against the ngram field

POST my_index/pages/_search
{
  "query": {
    "match": {
      "query": "Har",
      "operator": "and"
    }
  }
}
0
votes

I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html