1
votes

Elasticsearch newbie question. I loaded shakespeare.json into Elastic, and I'm trying to figure out how to do an aggregation analogous to select speaker, count(1) from line group by speaker. ("Line" is the type of document, and "speaker" is one of the properties.)

Now I have a query like this:

{
  "size": 0,
  "query": {
    "query": {
      "match": {
        "play_name": "HAMLET"
      }
    }
  },
  "aggs": {
    "line_count": {
      "terms": {
        "field": "speaker.speaker_raw"
      }
    }
  }
}

The results look right, but the ElasticSearch docs specify that document counts for the terms aggregation are approximate (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html). Is there some other magic to get exact counts within a bucket?

Separately, I already figured out that I had to pre-define a field on the index to have an un-analyzed version of "speaker" to ensure I can aggregate on the original field values, not tokenized. (See Elasticsearch - Cardinality over Full Field Value)

2
Counts in ES are accurate and complete for terms aggregations, the only approximate values are (IIRC) for cardinality and percentiles aggregations. See: elastic.co/guide/en/elasticsearch/reference/current/… And: elastic.co/guide/en/elasticsearch/reference/current/… - Or Weinberger
docs for terms aggregation also say counts are approximate as well. wondering if that is handled by size: 0? - wrschneider
Ha, never noticed that. If I'm reading correctly, the reason for the approximate count is due to shard bucketing being 'biased' regarding the 'top x' results. So I guess if you're using size:0 it should be accurate, what do you think? - Or Weinberger
I think you're right about that. If you make that an answer I can accept it. - wrschneider

2 Answers

3
votes

Setting size:0 is now deprecated, due to memory issues inflicted on a cluster with high-cardinality field values. You can use only number between 1 to 2147483647.

Source: https://github.com/elastic/elasticsearch/issues/18838

1
votes

According to the documentation, the reason for the approximate count in the terms aggregation is due to shard bucketing being 'biased' regarding the 'top x' results.

If you set "size": 0 I'm pretty certain that Elasticsearch will return accurate results.