How does Lucene filter over a range of continuous values

Question

From a data structure point of view how does Lucene filter over a range of continuous values?

I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.

But Lucene also provides very fast lookup on ranges of continuous values and if Lucene still uses this same compressed bit array approach, I don't understand how this happens efficiently in computation or memory. Here's my assumption, you tell me how close I am: Lucene discretizes the continuous field and creates a bit array for each of these (now discrete) values. Then when you want a range across values you grab the bit arrays corresponding to this range and just AND them together. Since I suspect there would be TONS of arrays to AND together, then maybe you do this at several levels of granularity so that you can AND together big chunks as much as possible.

UPDATE: Oh! I just realized another alternative that would be faster. Since we know that the bit arrays for the discretized ranges mentioned above will not overlap, then we can store the bit arrays sequentially. If we have a start and end values for our range, then we can keep an index to the corresponding points in this array of bit arrays. At that point we just jump in the the array of bit arrays and start scanning it as if we're scanning a single bit array. Am I close?

Hint: some of the sorting is done through the FieldCache, which is an uninverted view of the index. See also DocValues. — Doug T.

Robert Muir Robert Muir · Accepted Answer · 2015-01-01T18:20:20

Range queries (say 0 to 100) are a union of the lists of all the terms (1, 2, 3, 4, 5...) in the range. The problem is, if the range has to visit many terms, because it means processing many short term lists.

it would be better to process only a few long lists (which is what lucene is optimized for). So when you use a numeric field and index a number (like 4), we actually index it redundantly several times, adding some "fake terms" at lower precision. This allows us to instead process a range like 0 to 100 by processing say 7 terms instead of 100: "0-63", "64-95", 96, 97, 98, 99, 100. In this example "0-63" and "64-95" are the additional redundant terms that represent ranges of values.

How does Lucene filter over a range of continuous values

2 Answers