0
votes

Reading from elastic documentation:

the match_phrase query first analyzes the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same positions relative to each other.

I have configured my analyzer to use edge_ngram with keyword tokenizer :

{
        "index": {
            "number_of_shards": 1,
            "analysis": {
                "filter": {
                    "autocomplete_filter": {
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 20
                    }
                },
                "analyzer": {
                    "autocomplete": {
                        "type": "custom",
                        "tokenizer": "keyword",
                        "filter": [
                            "lowercase",
                            "autocomplete_filter"
                        ]
                    }
                }
            }
        }
    }

Here is the java class that is used for indexing :

@Document(indexName = "myindex", type = "program")
@Getter
@Setter
@Setting(settingPath = "/elasticsearch/settings.json")
public class Program {


    @org.springframework.data.annotation.Id
    private Long instanceId;

    @Field(analyzer = "autocomplete",searchAnalyzer = "autocomplete",type = FieldType.String )
    private String name;
}

if I have the following phrase in document "hello world", the following query will match it :

{
  "match" : {
    "name" : {
      "query" : "ho",
      "type" : "phrase"
    }
  }
}
result : "hello world"

that's not what I expect because not all of the search terms in the document.

my questions :

1- shouldn't I have 2 search terms in the edge_ngram/autocomplete for the query "ho" ? (the terms should be "h" and "ho" respectively. )

2- why does "ho" match "hello world" when all of the terms according to the definition of phrase query didn't match ? ("ho" term shouldn't have match)


update:

just in case that the question is not clear. The match phrase query should analyze the string to list of terms , here it's ho . Now we will have 2 terms as this is edge_ngram with 1 min_gram. The 2 terms are h and ho . according to elasticsearch the document must contain all of the search terms. However hello world has h only and doesn't have ho so why I did get a match here ?

3
(1) You haven't added index mapping, neither you have specified the type for name field. (2) You haven't specified any sample doc, so we don't know against what data you are try to match. Clarify these points so that people here can help you better.Nishant
@NishantSaini updated the questionMohammad Karmi

3 Answers

1
votes
  1. If you could provide complete, runnable examples for your problems it would make it much easier to help you. For example something like this:

    PUT test
    {
      "settings": {
        "number_of_shards": 1,
        "analysis": {
          "filter": {
            "autocomplete_filter": {
              "type": "edge_ngram",
              "min_gram": 1,
              "max_gram": 20
            }
          },
          "analyzer": {
            "autocomplete": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "lowercase",
                "autocomplete_filter"
              ]
            }
          }
        }
      },
      "mappings": {
        "_doc": {
          "properties": {
            "name": {
              "type": "text",
              "analyzer": "autocomplete"
            }
          }
        }
      }
    }
    
    PUT test/_doc/1
    {
      "name": "Hello world"
    }
    
    GET test/_search
    {
      "query": {
        "match_phrase": {
          "name": "hello foo"
        }
      }
    }
    
  2. Judging from your search query, you are using Elasticsearch 2.x or earlier. This is a dead version — you should really upgrade.

  3. I'm not sure phrase search on edge grams make much sense in combination. What are you trying to achieve here?
  4. Why is it matching? Your search query is running through the same analyzer as your stored field. Since you have defined min_gram: 1, your ho will be searched as h and ho. The h matches the h from hello. match or match_phrase doesn't make a difference here with this analyzer.
0
votes

If i understand your questions, tokenizer is the problem, "tokenizer": "keyword", search exact phrase and index like one.

Structured Text Tokenizers

0
votes

I have got the answer from elasticsearch forum :

You are using the edge_ngram token filter. Let's see how your analyzer treats your query string "ho" . Assuming your index is called my_index :

GET my_index/_analyze
{
  "text": "ho",
  "analyzer": "autocomplete"
}

The response shows you that the output of your analyzer would be two tokens at position 0:

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "ho",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

What does Elasticsearch do with a query for two tokens at the same position? It treat's the query as an "OR", even if you use a type "phrase" . You can see that from the output of the validate API (which shows you the Lucene query that your query was written into):

GET my_index/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "name": {
        "query": "ho",
        "type": "phrase"
      }
    }
  }
}

Because both your query and your document have an h at position 0, the document is going to be a hit.

Now, how to solve this? Instead of the edge_ngram token filter, you could use the edge_ngram tokenizer. This tokenizer increments the position of every token it outputs.

So, if you create your index like this instead:

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

You will see that this query is no longer a hit:

GET my_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "ho",
        "type": "phrase"
      }
    }
  }
}

But for example this one is:

GET my_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "he",
        "type": "phrase"
      }
    }
  }
}