Elasticsearch search for Turkish characters

Question

I have some documents that i am indexing with elasticsearch. But some of the documents are written with upper case and Tukish characters are changed. For example "kürşat" is written as "KURSAT".

I want to find this document by searching "kürşat". How can i do that?

Thanks

If you wanted to go the other way around (kürşat->KURSAT) it would be easy, but going that way, i.e. trying to infer that U should be ü is not really easy since U could also be a normal u (which is also valid in Turkish). Same goes for S. I guess you need to lookup the word in a dictionary somehow. — Val
That is the exact problem. It is easy to convert all "U" characters to "ü" but it is hard to identify which "u" is real "u" or "ü". I want to retreive both "kursat" and "kürşat" when i search for "kürşat" — Kursat Serolar

Byron Voorbach Byron Voorbach · Accepted Answer · 2017-02-28T14:47:49

Take a look at the asciifolding token filter.

Here is a small example for you to try out in Sense:

Index:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "analyzer": {
        "turkish_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_ascii_folding"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "turkish_analyzer"
        }
      }
    }
  }
}

POST test/test/1
{
  "name": "kürşat"
}

POST test/test/2
{
  "name": "KURSAT"
}

Query:

GET test/_search
{
  "query": {
    "match": {
      "name": "kursat"
    }
  }
}

Response:

 "hits": {
    "total": 2,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.30685282,
        "_source": {
          "name": "KURSAT"
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "name": "kürşat"
        }
      }
    ]
  }

Query:

GET test/_search
{
  "query": {
    "match": {
      "name": "kürşat"
    }
  }
}

Response:

 "hits": {
    "total": 2,
    "max_score": 0.4339554,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.4339554,
        "_source": {
          "name": "kürşat"
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.09001608,
        "_source": {
          "name": "KURSAT"
        }
      }
    ]
  }

Now the 'preserve_original' flag will make sure that if a user types: 'kürşat', documents with that exact match will be ranked higher than documents that have 'kursat' (Notice the difference in scores for both query responses).

If you want the score to be equal, you can put the flag on false.

Hope I got your problem right!

Elasticsearch search for Turkish characters

1 Answers