How to match query terms containing hyphens or trailing space in elasticsearch

Question

In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".

How would you advise I set the analyzer like? Currently this is the definition of the field:

"mappings": {
    "client": {
        "properties": {
            "ucn": {
                "type": "multi_field",
                "fields": {
                    "ucn_autoc": {
                        "type": "string",
                        "index": "analyzed",
                        "index_analyzer": "autocomplete_index",
                        "search_analyzer": "autocomplete_search"
                    },
                    "ucn": {
                        "type": "string",
                        "index": "not_analyzed"
                    }
                }
            }
        }
    }
}

Here is the settings for the index containing analyzer etc.

 "settings": {
        "analysis": {
            "filter": {
                "autocomplete_ngram": {
                    "max_gram": 15,
                    "min_gram": 1,
                    "type": "edge_ngram"
                },
                "ngram_filter": {
                    "type": "nGram",
                    "min_gram": 2,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "lowercase_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "autocomplete_index": {
                    "filter": [
                        "lowercase",
                        "autocomplete_ngram"
                    ],
                    "tokenizer": "keyword"
                },
                "ngram_index": {
                    "filter": [
                        "ngram_filter",
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "autocomplete_search": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "ngram_search": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                }
            },
            "index": {
                "number_of_shards": 6,
                "number_of_replicas": 1
            }
        }
    }

Andrei Stefan Andrei Stefan · Accepted Answer · 2015-01-29T08:50:32

You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:

{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": [
            "-=>"
          ]
        }
      },
      "analyzer": {
        "autocomplete_search": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_mapping"
          ],
          "filter": [
            "trim"
          ]
        },
        "autocomplete_index": {
          "tokenizer": "keyword",
          "filter": [
            "trim"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "ucn": {
          "type": "multi_field",
          "fields": {
            "ucn_autoc": {
              "type": "string",
              "index": "analyzed",
              "index_analyzer": "autocomplete_index",
              "search_analyzer": "autocomplete_search"
            },
            "ucn": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.

Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:

"tokens": [
      {
         "token": "012334742000",
         "start_offset": 0,
         "end_offset": 17,
         "type": "word",
         "position": 1
      }
   ]

which means it does eliminate the - and the white spaces. And the typical query would be:

{
  "query": {
    "match": {
      "ucn.ucn_autoc": " 0123-34742-000  "
    }
  }
}

How to match query terms containing hyphens or trailing space in elasticsearch

1 Answers