Redis, how does SCAN cursor "state management" work?

Question

Redis has a SCAN command that may be used to iterate keys matching a pattern etc.

You start by giving a cursor value of 0; each call returns a new cursor value which you pass into the next SCAN call. A value of 0 indicates iteration is finished. Supposedly no server or client state is needed (except for the cursor value)

I'm wondering how Redis implements the scanning algorithm-wise?

misterion misterion · Accepted Answer · 2015-01-23T06:13:11

You may find answer in redis dict.c source file. Then I will quote part of it.

Iterating works the following way:

Initially you call the function using a cursor (v) value of 0. 2)
The function performs one step of the iteration, and returns the
new cursor value you must use in the next call.
When the returned cursor is 0, the iteration is complete.

The function guarantees all elements present in the dictionary get returned between the start and end of the iteration. However it is possible some elements get returned multiple times. For every element returned, the callback argument 'fn' is called with 'privdata' as first argument and the dictionary entry'de' as second argument.

How it works

The iteration algorithm was designed by Pieter Noordhuis. The main idea is to increment a cursor starting from the higher order bits. That is, instead of incrementing the cursor normally, the bits of the cursor are reversed, then the cursor is incremented, and finally the bits are reversed again.

This strategy is needed because the hash table may be resized between iteration calls. dict.c hash tables are always power of two in size, and they use chaining, so the position of an element in a given table is given by computing the bitwise AND between Hash(key) and SIZE-1 (where SIZE-1 is always the mask that is equivalent to taking the rest of the division between the Hash of the key and SIZE).

For example if the current hash table size is 16, the mask is (in binary) 1111. The position of a key in the hash table will always be the last four bits of the hash output, and so forth.

What happens if the table changes in size?

If the hash table grows, elements can go anywhere in one multiple of the old bucket: for example let's say we already iterated with a 4 bit cursor 1100 (the mask is 1111 because hash table size = 16).

If the hash table will be resized to 64 elements, then the new mask will be 111111. The new buckets you obtain by substituting in ??1100 with either 0 or 1 can be targeted only by keys we already visited when scanning the bucket 1100 in the smaller hash table.

By iterating the higher bits first, because of the inverted counter, the cursor does not need to restart if the table size gets bigger. It will continue iterating using cursors without '1100' at the end, and also without any other combination of the final 4 bits already explored.

Similarly when the table size shrinks over time, for example going from 16 to 8, if a combination of the lower three bits (the mask for size 8 is 111) were already completely explored, it would not be visited again because we are sure we tried, for example, both 0111 and 1111 (all the variations of the higher bit) so we don't need to test it again.

Wait... You have TWO tables during rehashing!

Yes, this is true, but we always iterate the smaller table first, then we test all the expansions of the current cursor into the larger table. For example if the current cursor is 101 and we also have a larger table of size 16, we also test (0)101 and (1)101 inside the larger table. This reduces the problem back to having only one table, where the larger one, if it exists, is just an expansion of the smaller one.

Limitations

This iterator is completely stateless, and this is a huge advantage, including no additional memory used. The disadvantages resulting from this design are:

It is possible we return elements more than once. However this is usually easy to deal with in the application level.
The iterator must return multiple elements per call, as it needs to always return all the keys chained in a given bucket, and all the expansions, so we are sure we don't miss keys moving during rehashing.
The reverse cursor is somewhat hard to understand at first, but this comment is supposed to help.