Search special characters with elasticsearch

Question

I just have problem with elasticsearch, I have some business requirement that need to search with special characters. For example, some of the query string might contain (space, @, &, ^, (), !) I have some similar use case below.

foo&bar123 (an exact match)
foo & bar123 (white space between word)
foobar123 (No special chars)
foobar 123 (No special chars with whitespace)
foo bar 123 (No special chars with whitespace between word)
FOO&BAR123 (Upper case)

All of them should match the same results, can anyone please give me some help about this? Note this right now I can search other string with no special characters perfectly

{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "analyzer": {
                "autocomplete": {
                    "tokenizer": "custom_tokenizer"
                }
            },
            "tokenizer": {
                "custom_tokenizer": {
                  "type": "ngram",
                  "min_gram": 2,
                  "max_gram": 30,
                  "token_chars": [
                    "letter",
                    "digit"
                  ]
                }
          }
        }
    },
        "mappings": {
            "index": {
                "properties": {
                    "some_field": {
                        "type": "text",
                        "analyzer": "autocomplete"
                    },
                    "some_field_2": {
                        "type": "text",
                        "analyzer": "autocomplete"
                    }
                }
           }
    }
}

ifo20 ifo20 · Accepted Answer · 2018-08-10T20:37:42

EDIT:

There are two things to check here:

(1) Is the special character being analysed when we index the document?

The _analyze API tells us no:

POST localhost:9200/index-name/_analyze
{
    "analyzer": "autocomplete",
    "text": "foo&bar"
}

// returns
fo, foo, foob, fooba, foobar, oo, oob, // ...etc: the & has been ignored

This is because the "token_chars" in your mapping: "letter", "digit". These two groups do not include punctuation such as '&'. Hence, when you upload "foo&bar" to the index, the & is actually ignored.

To include the & in the index, you want to include "punctuation" in your "token_chars" list. You may also want the "symbol" group too for some of your other chars... :

"tokenizer": {
    "custom_tokenizer": {
        "type": "ngram",
            "min_gram": 2,
            "max_gram": 30,
            "token_chars": [
                "letter",
                "digit",
                "symbol",
                "punctuation"
              ]
     }
}

Now we see the the terms being analyed appropriately:

POST localhost:9200/index-name/_analyze
{
    "analyzer": "autocomplete",
    "text": "foo&bar"
}

// returns
fo, foo, foo&, foo&b, foo&ba, foo&bar, oo, oo&, // ...etc

(2) Is my search query doing what I expect?

Now that we know the 'foo&bar' document is being indexed (analyzed) correctly, we need to check that the search returns the result. The following query works:

POST localhost:9200/index-name/_doc/_search
{
    "query": {
        "match": { "some_field": "foo&bar" }
    }
}

As does the GET query http://localhost:9200/index-name/_search?q=foo%26bar

Other queries may have unexpected to results - according to the docs, you probably want to declare your search_analyzer to be different than your index analyzer (e.g. ngram index analyzer and standard search analyzer) ... however this is up to you

Search special characters with elasticsearch

1 Answers

(1) Is the special character being analysed when we index the document?

(2) Is my search query doing what I expect?