6
votes

I have documents that I want to index in ElasticSearch that contains a text field called name. I currently index the name using the snowball analyzer. However, I would like to match names both with and without included spaces. For example, a document with the name "The Home Depot" should match "homedepot", "home", and "home depot". Additionally, documents with a single word name like "ExxonMobil" should match "exxon mobil" and "exxonmobil".

I can't seem to find the right combination of analyzer/filters to accomplish this.

2

2 Answers

5
votes

I think the most direct approach to this problem would be to apply a Shingle token filter, which, instead of creating ngrams of characters, creates combinations of incoming tokens. You can add it to your analyzer something like:

filter:
    ........
    my_shingle_filter:
        type: shingle
        min_shingle_size: 2
        max_shingle_size: 3
        output_unigrams: true
        token_separator: ""

you should be mindful of where this filter is placed in your filter chain. It should probably come late in the chain, after all token separation/removal/replacement has already occurred (ie. after any StopFilters, SynonymFilters, stemmers, etc).

-3
votes

In this case, you might need to look at an ngram type solution.

Ngram does something like this:

Given the text abcd and analyzed with ngram you might get the tokens:

a
ab
abc
abcd
b
bc
bcd
c
cd
d

below is a setting that might work for you.

You might need to tinker with the filter portion. This particular filter creates grams up to 12 units long and a minimum of two tokens.

Now, if you need it to do further analysis that snowball gives you (like water, waters, watering all matching the token water) you will need to tinker yet further.

        "filter": {
            "ngram_filter": {
                "type": "nGram",
                "min_gram": 2,
                "max_gram": 12
            }
        },
        "analyzer": {
            "ngram_index": {
                "filter": [
                    "lowercase",
                    "ngram_filter"
                ],
                "tokenizer": "keyword"
            },
            "ngram_search": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "keyword"
            }
        }
    },

The idea here is at indextime you want to create the right tokens to be available at searchtime. But, all you need to do at searchtime is make those tokens available. You don't need to reapply the ngram analyzer again.

EDIT:

One last thing I just noticed, this requirement: "ExxonMobil" should match "exxon mobil"

Probably means you will need do something like this:

            "ngram_search": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "whitespace"

            }

Note the addition of the "whitespace" tokenizer instead of keyword. This allows the search to split on whitespace.