2
votes

I just have problem with elasticsearch, I have some business requirement that need to search with special characters. For example, some of the query string might contain (space, @, &, ^, (), !) I have some similar use case below.

  1. foo&bar123 (an exact match)
  2. foo & bar123 (white space between word)
  3. foobar123 (No special chars)
  4. foobar 123 (No special chars with whitespace)
  5. foo bar 123 (No special chars with whitespace between word)
  6. FOO&BAR123 (Upper case)

All of them should match the same results, can anyone please give me some help about this? Note this right now I can search other string with no special characters perfectly

{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "analyzer": {
                "autocomplete": {
                    "tokenizer": "custom_tokenizer"
                }
            },
            "tokenizer": {
                "custom_tokenizer": {
                  "type": "ngram",
                  "min_gram": 2,
                  "max_gram": 30,
                  "token_chars": [
                    "letter",
                    "digit"
                  ]
                }
          }
        }
    },
        "mappings": {
            "index": {
                "properties": {
                    "some_field": {
                        "type": "text",
                        "analyzer": "autocomplete"
                    },
                    "some_field_2": {
                        "type": "text",
                        "analyzer": "autocomplete"
                    }
                }
           }
    }
}
1

1 Answers

6
votes

EDIT:

There are two things to check here:

(1) Is the special character being analysed when we index the document?

The _analyze API tells us no:

POST localhost:9200/index-name/_analyze
{
    "analyzer": "autocomplete",
    "text": "foo&bar"
}

// returns
fo, foo, foob, fooba, foobar, oo, oob, // ...etc: the & has been ignored

This is because the "token_chars" in your mapping: "letter", "digit". These two groups do not include punctuation such as '&'. Hence, when you upload "foo&bar" to the index, the & is actually ignored.

To include the & in the index, you want to include "punctuation" in your "token_chars" list. You may also want the "symbol" group too for some of your other chars... :

"tokenizer": {
    "custom_tokenizer": {
        "type": "ngram",
            "min_gram": 2,
            "max_gram": 30,
            "token_chars": [
                "letter",
                "digit",
                "symbol",
                "punctuation"
              ]
     }
}

Now we see the the terms being analyed appropriately:

POST localhost:9200/index-name/_analyze
{
    "analyzer": "autocomplete",
    "text": "foo&bar"
}

// returns
fo, foo, foo&, foo&b, foo&ba, foo&bar, oo, oo&, // ...etc

(2) Is my search query doing what I expect?

Now that we know the 'foo&bar' document is being indexed (analyzed) correctly, we need to check that the search returns the result. The following query works:

POST localhost:9200/index-name/_doc/_search
{
    "query": {
        "match": { "some_field": "foo&bar" }
    }
}

As does the GET query http://localhost:9200/index-name/_search?q=foo%26bar

Other queries may have unexpected to results - according to the docs, you probably want to declare your search_analyzer to be different than your index analyzer (e.g. ngram index analyzer and standard search analyzer) ... however this is up to you