1
votes

I'm basically trying to disable the lowercase filter to be able to do case sensitive matching on text fields. Following the index, and analyzer docs I create the following mapping without the lowercase filter:

PUT /my_index

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "asciifolding"
          ]
        }
      }
    }
  }
}

I enable fielddata so I can inspect the tokenization afterward

PUT my_index/_mapping/_doc

{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

I test the custom analyzer to make sure it doesn't lowercase, as expected

POST /my_index/analyze

{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà Vu</b>?"
}

which gets the following response

{
  "tokens": [
    {
      "token": "Is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 11,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "Vu",
      "start_offset": 16,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Great, things are not getting lowercased just like I wanted. So now I try inserting the same text and see what happens.

POST /my_index/_doc

{
  "my_field": "Is this <b>déjà Vu</b>?"
}

and try querying back for it

POST /my_index/_search

{
  "query": {
    "regexp": {
      "my_field": "Is.*"
    }
  },
  "docvalue_fields": [
    "my_field"
  ]
}

and get no hits. Now if I try lowercasing the regex, I get

POST /my_index/_search

{
  "query": {
    "regexp": {
      "my_field": "is.*"
    }
  },
  "docvalue_fields": [
    "my_field"
  ]
}

which returns

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "6d6PP20BXDCQSINU0RC_",
        "_score": 1,
        "_source": {
          "my_field": "Is this <b>déjà Vu</b>?"
        },
        "fields": {
          "my_field": [
            "b",
            "déjà",
            "is",
            "this",
            "vu"
          ]
        }
      }
    ]
  }
}

So it seems to me like things are still getting lowercased somewhere since only the lowercase regex matches and the docvalues all come back lower cased. What am I doing wrong here?

1

1 Answers

1
votes

Good start so far!!!

The only issue is that you're not applying your custom analyzer to your field. Change your mapping to this and it's going to get you further.

PUT my_index/_mapping/_doc
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true,
      "analyzer": "my_custom_analyzer"       <-- add this
    }
  }
}