I'm using a custom analyzer to remove a certain set of stopwords. I'm then making phrase match queries with text that includes some of the stopwords. I would expect that the stopwords get filtered out of the query, however they are not (and any documents that do not include the stopwords are being excluded from the results).
Here's a simplified example of what I'm trying to do:
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create index, with a custom analyzer to filter out the word 'foo'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"filter": [
"fooFilter"
]
}
},
"filter": {
"fooFilter": {
"type": "stop",
"stopwords": [
"foo"
]
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}'
# Add sample document
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"myDocument"}}
{"myMessage":"bar baz"}
'
If I perform a phrase_match search against this index with a filtered stopword in the middle of the query, I would expect it to match (since 'foo' should be filtered away by our analyzer).
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "bar foo baz"
}
}
}
}
'
However, I get no results.
Is there a way to instruct Elasticsearch to tokenize and filter the query string before performing the search?
Edit 1: now I'm even more confused. I was seeing before that phrase matching wasn't working if my query contained stopwords in the middle of the query text. Now, in addition, I'm seeing that the phrase query does not work if the document contains stopwords in the middle of the query text. Here's a minimal example, still using the mapping from above.
POST play/myDocument
{
"myMessage": "fib foo bar" <---- remember that 'foo' is a stopword and is filtered out of analysis
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
This query does not match. I'm very surprised by this! I would expect the foo stopword to be filtered out and ignored.
For an example of why I'd expect this, see this query:
POST play/myDocument
{
"myMessage": "fib 123 bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
This matches, because the '123' is filtered out by my 'letter' tokenizer. It seems like phrase matching is ignoring the stopword filtering completely, and acting as if those tokens were in the analyzed field all along (even though they don't show up in the list of tokens from _analyze).
My current best idea for a workaround:
- call the _analyze endpoint against my document's text string using my custom analyzer. this will return the tokens from the original text string but remove the pesky stopwords for me
- save a version of my text using only the tokens into a "filtered" field in the document
Later, at query time:
- call the _analyze endpoint against my query string using my custom analyzer to get just the tokens
- make my phrase match query using the filtered token string against the document's new "filtered" field