1
votes

I am attempting to create an Elasticsearch query that will perform a partial and full text match on two fields name and type in my index and return all matches that contain a specific uid field value. As example, I have the following records:

  • { name: "Doug", "type": "Large"}
  • { name: "Doug Small", "type":"Large"}
  • { name: "Smal", "type": "Medium"}
  • { name: "Peter", "type": "Small"}

I would like my query to match and return all of these records. Here is my query so far:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "fields": [
              "name",
              "type"
            ],
            "query": "*Doug Small*~",
            "default_operator": "AND"
          }
        }
      ],
      "filter": [
        {
          "match": {
            "uid": "123"
          }
        }
      ]
    }
  }
}

In order to get any results to return I had to wrap the query in * and also add the fuzzy ~ at the end. Is this the right type of query for this use case?

Here is my mapping:

{
  "test": {
    "mappings": {
      "data": {
        "properties": {
          "uid": {
            "type": "keyword"
          },
          "name": {
            "type": "keyword"
          },
          "type": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
1
please share your mapping as wellAlkis Kalogeris
@AlkisKalogeris mapping has been added. Thanksmedium

1 Answers

3
votes

There are multiple problems to consider here.

  1. Do not use query_string, unless you know exactly what you are doing. Pay special attention if the input is coming from the user. Prefer to use simple_query_string instead.
  2. I doubt that you want the name to be of type keyword. This type means that the string will not be analyzed (lowercased, tokenized etc). So if you search with something other than the exact same input then it won't match. e.g. Doug Small. You would think that since you search with the exact same input, at least this document would return, but that's not the case. The reason is that query_string or simple_query_string input is parsed (and as a consequence tokenized). If you don't specify your input as one term then it won't match. In order to do that you need to wrap your term with double quotes ("\"Doug Small\""). But if you do this, you will lose all other matches.
  3. I believe what you need is the name and type to be of type text. This means that the saved string will be analyzed (tokenized, lowercased etc, check simple analyzer (which is the default if you don't specify another).
  4. You have operator specified as AND for query_string. This means that all of the query terms must match on either name or type. But you are stating that you need to have all documents returned with your query. Only one document has both Doug and Small. If you need this then that operator must change to OR (which is the default).

A complete example

PUT test
{
  "mappings": {
    "properties": {
      "uid": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      },
      "type": {
        "type": "text"
      }
    }
  }
}
POST test/_bulk
{ "index" : { "_id" : "1" } }
{ "name": "Doug", "type": "Large"}
{ "index" : { "_id" : "2" } }
{ "name": "Doug Small", "type":"Large"}
{ "index" : { "_id" : "3" } }
{ "name": "Smal", "type": "Medium"}
{ "index" : { "_id" : "4" } }
{ "name": "Peter", "type": "Small"}
GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "name",
              "type"
            ],
            "query": "*Doug Small*",
            "default_operator": "OR"
          }
        }
      ]
    }
  }
}

The above query now returns all three documents that have Doug or Small or both. Moreover, is case insensitive (since it's now analyzed) so this *doug small* will yield the same 3 results.

Since now the fields are analyzed you don't need to use the wildcard symbol, because it is now for the first token and the last. Meaning

  • *Doug Small*: Match anything that has <ANYTHING>Dog OR Small<Anything>
  • *Doug Smith Small*: Match anything that has <ANYTHING>Dog OR Smith OR Small<Anything> (OR -> default operator, if you keep AND then it changes accordingly)

So let's remove the wildcard as well

GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "name",
              "type"
            ],
            "query": "Doug Small",
            "default_operator": "OR"
          }
        }
      ]
    }
  }
}

This yields the exact same 3 results. You are still missing Smal. Now you need to add fuzzy matching in order to include that as well.

GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "name",
              "type"
            ],
            "query": "Doug Small~",
            "default_operator": "OR"
          }
        }
      ]
    }
  }
}

This Doug Small~ means bring everything that has Doug OR Small where Small can be a NOT exact match.

You can have fuzzy matching for all your terms

GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "name",
              "type"
            ],
            "query": "Dg~ Small~",
            "default_operator": "OR"
          }
        }
      ]
    }
  }
}

The reason why Dg matches with Doug is because of the fuziness level https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness

The maximum allowed Levenshtein Edit Distance (or number of edits)