3
votes

I have a content field (string) indexed in elasticsearch. The analyzer is default one - standard analyzer.

When I use match query to search:

{"query":{"match":{"content":"micro soft", "operator":"and"}}}

Result shows it can't match "microsoft".

Then how to use input keyword "micro soft" to match the document content contains "microsoft"?

3

3 Answers

1
votes

Another solution to this is to use the nGram token filter, which would allow you to have a more "fuzzy" match.

Using your example for "microsoft" and "micro soft", here is an example of how an ngram token filter would break down the tokens:

POST /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "5"
        }
      },
      "analyzer" : {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter": ["my_ngrams"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

And analyzing the two things:

curl '0:9200/test/_analyze?field=body&pretty' -d'microsoft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "icros"
  }, {
    "token" : "cro"
  }, {
    "token" : "cros"
  }, {
    "token" : "croso"
  }, {
    "token" : "ros"
  }, {
    "token" : "roso"
  }, {
    "token" : "rosof"
  }, {
    "token" : "oso"
  }, {
    "token" : "osof"
  }, {
    "token" : "osoft"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

curl '0:9200/test/_analyze?field=body&pretty' -d'micro soft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "cro"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

(I cut out some of the output, full output here: https://gist.github.com/dakrone/10abb4a0cfe8ce8636ad)

As you can see, since the ngram terms for "microsoft" and "micro soft" overlap, you will be able to find matches for searches like this.

1
votes

Another approach to this problem is to do word decomposition you can either use a dictionary based approach: Compound Word Token Filter or to use a plugin which decomposes words algorithmically: Decompound plugin.

The word microsoft would e.g. be split into following tokens:

{
   "tokens": [
      {
         "token": "microsoft",
      },
      {
         "token": "micro",
      },
      {
         "token": "soft",
      }
   ]
}

This tokens will allow you to search for partial words like you asked.

Compared to the ngrams approach mentioned in the other answer, this approach will result in a higher precision with only a slightly lower recall.

0
votes

Try this ES wilcard as below

 { 
 "query" : { 
     "bool" : { 
         "must" : { 
             "wildcard" : { "content":"micro*soft" } 
         } 
     } 
 }

}