There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:
- you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
- no highlight is supported - keyword tokenizer does not support
- case sensitivity is also a problem
- normalizers(analyzers for keywords) are experimental feature - bad support, no features
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"normalizer": {
"custom_normalizer": {
"type": "custom",
"char_filter": ["stopword_char_filter", "trim_char_filter"],
"filter": ["lowercase"]
}
},
"char_filter": {
"stopword_char_filter": {
"type": "pattern_replace",
"pattern": "( ?und ?| ?gmbh ?)",
"replacement": " "
},
"trim_char_filter": {
"type": "pattern_replace",
"pattern": "(\\s+)$",
"replacement": ""
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "custom_normalizer"
}
}
}
}
}'
Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)
curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
"normalizer": "custom_normalizer",
"text": "hansel und gretel gmbh"
}'
Now we are ready to index our document:
curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
"name": "hansel und gretel gmbh"
}'
And the last step is search:
curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"name" : {
"query" : "hansel gretel"
}
}
}
}'
Another approach is:
- create standard text analyzer with stop words filter
- use analysis to filter out all stop words and special symbols
- concatenate tokens manually
- send term to ES as keyword
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "custom_stopwords"]
}
}, "filter": {
"custom_stopwords": {
"type": "stop",
"stopwords": ["und", "gmbh"]
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}'
Now we are ready to analyze our text:
POST test/_analyze
{
"analyzer": "custom_analyzer",
"text": "Hansel und Gretel Gmbh."
}
with the following result:
{
"tokens": [
{
"token": "hansel",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gretel",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.