A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements)
of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1
another element "is this big string"
. So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m
which consists of {0, 1}
with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2
, 1/4
and 1/2^k
. This means that if you have encountered a string with k
zeros, you have approximately looked through 2^k
elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0
and 2^k - 1
you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0
t 2^k-1
is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0
and 2^k - 1
(SHA1 give you values between 0
and 2^160
). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k
bits by storing only one number of size log(k)
bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x
) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0
to 2^10
produced 2 values: 344
and 387
. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)
).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).