ElasticSearch nGram filters out punctuation

Question

In my ElasticSearch dataset we have unique IDs that are separated with a period. A sample number might look like c.123.5432

Using an nGram I'd like to be able to search for: c.123.54

This doesn't return any results. I believe the tokenizer is splitting on the period. To account for this I added "punctuation" to the token_chars, but there's no change in results. My analyzer/tokenizer is below.

I've also tried: "token_chars": [] <--Per the documentation this should keep all characters.

"settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "my_ngram_analyzer" : {
                    "tokenizer" : "my_ngram_tokenizer"
                }
            },
            "tokenizer" : {
                "my_ngram_tokenizer" : {
                    "type" : "nGram",
                    "min_gram" : "1",
                    "max_gram" : "10",
                    "token_chars": [ "letter", "digit", "whitespace", "punctuation", "symbol" ]
                }
            }
        }
    }
},

Edit(More info): This is the mapping of the relevant field:

"ProjectID":{"type":"string","store":"yes", "copy_to" : "meta_data"},

And this is the field I'm copying it into(that also has the ngram analyzer):

"meta_data" : { "type" : "string", "store":"yes", "index_analyzer": "my_ngram_analyzer"}

This is the command I'm using in sense to see if my search worked (see that it's searching the "meta_data" field):

GET /_search?pretty=true
{ 
    "query": {
        "match": {
            "meta_data": "c.123.54"
        }
    }
}

What happens when you test it with: /<index>/_analyze?text="c.1234.56"&tokenizer=my_ngram_tokenizer — mconlin
@mconlin I receive a result that looks like this: pastebin.com/udrPLSy4 — Brandon
@mconlin Something important: Existing data is "C.A1234.5678". If I search for "1234" I get a result. If I search for "A1234" I get nothing. — Brandon
you are right, I just cant get this to work.. have you tried posting to he elasticsearch google group OR the github issues page for es, both places have a lot of the core devs on it and they answer questions quickly. — mconlin
For reference, crossposted to github: github.com/elasticsearch/elasticsearch/issues/5120 — Brandon

Brandon Brandon · Accepted Answer · 2014-02-20T13:30:38

Solution from s1monw at https://github.com/elasticsearch/elasticsearch/issues/5120

By using an index_analyzer search only uses a standard analyzer. To fix it I modified index_analyzer to analyzer. Keep in mind the number of results will increase greatly, so changing the min_gram to a higher number may be necessary.

ElasticSearch nGram filters out punctuation

1 Answers