I'm using elasticsearch 2.4 and would like to get distinct counts for various entities in my data. I've played around with lot queries which include two ways of calculating distinct counts. One is through a cardinality aggregation and other is doing a terms aggregation can then getting distinct counts by calculating bucket size. By the former approach I've seen the counts being erroneous and inaccurate, but faster and relatively simple. My data is huge and will increase with time, so I do not know how cardinality aggregation will perform, whether it will become more accurate or less accurate.Wanted to take some advice from people who have had this question before and which approach they chose.
2 Answers
cardinality aggregation takes an additional parameter for precision_threshold
The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default values is 3000.
- configurable precision, which decides on how to trade memory for accuracy,
- excellent accuracy on low-cardinality sets,
- fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
In short, cardinality can give you exact counts upto a maximum of 40000 cardinality after which it gives an approximate count. Higher the precision_threshold, higher the memory cost and higher the accuracy. For very high values, it can only give you an approximate count.
To add to what Rahul said in the below answer. Cardinality will give you an approximate count yes, but if you set the precision threshold to its maximum value which is 40000 it will give you accurate results till 40000. Above which the error rate increases but more importantly it never goes above 1%, even upto 10 million documents.
See screen-shot below
Source: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-metrics-cardinality-aggregation.html
Also if we look at it from the user's perspective. If the user gets the count of 10 million documnets or for a matter of fact even a million documets off by 1% it will not make much of a difference and will go unnoticed. And when the user wants to look at the actual data he will do a search anyways which will return accurate results.