In my index-template, I have defined a custom analyzer, where a stop words filter is included. See following snippet:
"settings": {
"index.analysis.filter.german_stemmer.type": "stemmer",
"index.analysis.filter.german_stop.type": "stop",
"index.analysis.filter.german_stemmer.language": "light_german",
"index.analysis.filter.german_keywords.keywords.0": "",
"index.analysis.filter.german_stop.stopwords": "_german_",
"index.analysis.filter.german_keywords.type": "keyword_marker",
"index.analysis.analyzer.unigram.filter.0": "lowercase",
"index.analysis.analyzer.unigram.filter.1": "german_stop",
"index.analysis.analyzer.unigram.filter.2": "german_keywords",
"index.analysis.analyzer.unigram.filter.3": "german_normalization",
"index.analysis.analyzer.unigram.filter.4": "german_stemmer",
"index.analysis.analyzer.unigram.tokenizer": "standard",
}
I have also defined a mapping for unigram on the textBody-field. At next I try to get the most frequent words, by looking at the top-100 document counts:
"aggs":{
"wordcounts":{
"terms":{
"field" : "textBody",
"size" : 100
}
}
}
Unfortunately in this approach, there are also stop words included in the results. This stop words have a high document count, but zero word frequency (follow up query via script field tf()). Is there a way to remove the stop words from my aggregation result?
P.S.: The significant terms query also gives me stop words in my result set.
Examples for german stop words: "viel", "muss", "soll", "war", "weg", "den", ...
I am using elasticsearch-groovy:1.7.0, which is build on top of:
- elasticsearch:1.7.0
- lucene-core:4.10.4
update:
I figured out, that there are some word forms which are reduced to the words mentioned above. For example "muss" is in the text in the following word forms: "muss", "muesse", "muessen", "muß", "muße", "müsse" and "müsser". These words all get reduced to the stemmed word "muss". If I query for the word "muss", then I get zero results, because it gets filtered by the stop words filter.
In some queries (e.g. must_not filter) it is even possible to query "muss" and as a result I get the aggregated count of all other word forms then "muss".