2
votes

In my index-template, I have defined a custom analyzer, where a stop words filter is included. See following snippet:

  "settings": {
     "index.analysis.filter.german_stemmer.type": "stemmer",
     "index.analysis.filter.german_stop.type": "stop",
     "index.analysis.filter.german_stemmer.language": "light_german",
     "index.analysis.filter.german_keywords.keywords.0": "",
     "index.analysis.filter.german_stop.stopwords": "_german_",
     "index.analysis.filter.german_keywords.type": "keyword_marker",
     "index.analysis.analyzer.unigram.filter.0": "lowercase",
     "index.analysis.analyzer.unigram.filter.1": "german_stop",
     "index.analysis.analyzer.unigram.filter.2": "german_keywords",
     "index.analysis.analyzer.unigram.filter.3": "german_normalization",
     "index.analysis.analyzer.unigram.filter.4": "german_stemmer",
     "index.analysis.analyzer.unigram.tokenizer": "standard",
  }

I have also defined a mapping for unigram on the textBody-field. At next I try to get the most frequent words, by looking at the top-100 document counts:

  "aggs":{
    "wordcounts":{
      "terms":{
        "field" : "textBody",
        "size" : 100
      }
    }
  }

Unfortunately in this approach, there are also stop words included in the results. This stop words have a high document count, but zero word frequency (follow up query via script field tf()). Is there a way to remove the stop words from my aggregation result?

P.S.: The significant terms query also gives me stop words in my result set.

Examples for german stop words: "viel", "muss", "soll", "war", "weg", "den", ...

I am using elasticsearch-groovy:1.7.0, which is build on top of:

  • elasticsearch:1.7.0
  • lucene-core:4.10.4

update:

I figured out, that there are some word forms which are reduced to the words mentioned above. For example "muss" is in the text in the following word forms: "muss", "muesse", "muessen", "muß", "muße", "müsse" and "müsser". These words all get reduced to the stemmed word "muss". If I query for the word "muss", then I get zero results, because it gets filtered by the stop words filter.

In some queries (e.g. must_not filter) it is even possible to query "muss" and as a result I get the aggregated count of all other word forms then "muss".

1
Which elasticsearch and lucene version are you running? I suspect that older lucene versions didn't include all your stopwords - Mario Trucco
elasticsearch 1.7.0, lucene 4.10.4 - boraas

1 Answers

0
votes

Lucene versions prior to 5.0 appear to use an hard coded list of stopwords. (view source code). This does not include "viel", "muss", "soll", "weg", "den". It does include "war".

Hovever, you can specify a stopwords_path to a file containing all the stopwords that you need to have (See https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html).

From version 5.0 the stopwords are always read from file, but I could not find the text of the default file. I found some files and they where richer than the hard coded list of version 4.10.4, but I don't know which one is used by default (here is one)