Can't deal with accents in Elasticsearch indexing and search

Question

I have an issue with elasticsearch and the way the data are indexed/retrieved. I don't understand what happens.

This is the mapping I use (sorry, it's yaml format) : The idea is simple, in theory... I have a string analyzer with lowercase and asciifolding filters. I don't want to care about case or accents, and I would like to use this analyzer to index and search.

settings:
    index:
        analysis:
            filter:
                autocomplete_filter:
                    type: edgeNGram
                    side: front
                    min_gram: 1
                    max_gram: 20
            analyzer:
                autocomplete:
                    type: custom
                    tokenizer: standard
                    filter: [lowercase, asciifolding, autocomplete_filter]
                string_analyzer:
                    type:        custom
                    tokenizer:   standard
                    filter:      [lowercase, asciifolding]
types:
    city:
        mappings:
            cityName:
                type: string
                analyzer: string_analyzer
                search_analyzer: string_analyzer
            location: {type: geo_point}

When I run this query :

{
    "query": {
        "prefix":{
            "cityName":"per"
        }
    }
    ,
    "size":20
}

I get some results like "Perpezat", "Pern", "Péreuil" which is the excepted result.

But if I run the following query :

{
    "query": {
        "prefix":{
            "cityName":"pér"
        }
    }
    ,
    "size":20
}

Then I get no result at all.

If you have any clue or help, I would be happy to know it. Thanks

Mario Trucco Mario Trucco · Accepted Answer · 2017-06-04T15:30:11

In the Prefix Query, your search input is not analyzed like in other cases:

Matches documents that have fields containing terms with a specified prefix (not analyzed)

Your first example works because the documents are analyzed at index time using your analyzer with lowercase and asciifolding, so they contain a term starting with per (perpezat, pern, pereuil).

Your second example does not work because those documents don't contain any terms starting with pér.

Since I couldn't find a way to tell Elasticsearch to analyze the prefix before performing the search, you could achieve your goal by manually adding this step:

Ask Elastisearch to analyze your input calling the Analyze API
Use the output from step 1 (it should be per in the examples) for the prefix query

For this to work, your search input should be a single term (I think that could be why Elasticsearch doesn't want to analyze it in the first place)

Can't deal with accents in Elasticsearch indexing and search

2 Answers