I'm basically trying to disable the lowercase filter to be able to do case sensitive matching on text fields. Following the index, and analyzer docs I create the following mapping without the lowercase filter:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"asciifolding"
]
}
}
}
}
}
I enable fielddata so I can inspect the tokenization afterward
PUT my_index/_mapping/_doc
{
"properties": {
"my_field": {
"type": "text",
"fielddata": true
}
}
}
I test the custom analyzer to make sure it doesn't lowercase, as expected
POST /my_index/analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà Vu</b>?"
}
which gets the following response
{
"tokens": [
{
"token": "Is",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "this",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "deja",
"start_offset": 11,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "Vu",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
}
]
}
Great, things are not getting lowercased just like I wanted. So now I try inserting the same text and see what happens.
POST /my_index/_doc
{
"my_field": "Is this <b>déjà Vu</b>?"
}
and try querying back for it
POST /my_index/_search
{
"query": {
"regexp": {
"my_field": "Is.*"
}
},
"docvalue_fields": [
"my_field"
]
}
and get no hits. Now if I try lowercasing the regex, I get
POST /my_index/_search
{
"query": {
"regexp": {
"my_field": "is.*"
}
},
"docvalue_fields": [
"my_field"
]
}
which returns
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "6d6PP20BXDCQSINU0RC_",
"_score": 1,
"_source": {
"my_field": "Is this <b>déjà Vu</b>?"
},
"fields": {
"my_field": [
"b",
"déjà",
"is",
"this",
"vu"
]
}
}
]
}
}
So it seems to me like things are still getting lowercased somewhere since only the lowercase regex matches and the docvalues all come back lower cased. What am I doing wrong here?