2
votes

From a data structure point of view how does Lucene filter over a range of continuous values?

I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.

But Lucene also provides very fast lookup on ranges of continuous values and if Lucene still uses this same compressed bit array approach, I don't understand how this happens efficiently in computation or memory. Here's my assumption, you tell me how close I am: Lucene discretizes the continuous field and creates a bit array for each of these (now discrete) values. Then when you want a range across values you grab the bit arrays corresponding to this range and just AND them together. Since I suspect there would be TONS of arrays to AND together, then maybe you do this at several levels of granularity so that you can AND together big chunks as much as possible.

UPDATE: Oh! I just realized another alternative that would be faster. Since we know that the bit arrays for the discretized ranges mentioned above will not overlap, then we can store the bit arrays sequentially. If we have a start and end values for our range, then we can keep an index to the corresponding points in this array of bit arrays. At that point we just jump in the the array of bit arrays and start scanning it as if we're scanning a single bit array. Am I close?

2
Hint: some of the sorting is done through the FieldCache, which is an uninverted view of the index. See also DocValues.Doug T.
Any chances this answers your question?mindas
@mindas - thanks, that certainly gets me closerJnBrymn

2 Answers

3
votes

Range queries (say 0 to 100) are a union of the lists of all the terms (1, 2, 3, 4, 5...) in the range. The problem is, if the range has to visit many terms, because it means processing many short term lists.

it would be better to process only a few long lists (which is what lucene is optimized for). So when you use a numeric field and index a number (like 4), we actually index it redundantly several times, adding some "fake terms" at lower precision. This allows us to instead process a range like 0 to 100 by processing say 7 terms instead of 100: "0-63", "64-95", 96, 97, 98, 99, 100. In this example "0-63" and "64-95" are the additional redundant terms that represent ranges of values.

2
votes

I got a chance to study up on this in depth. The tl;dr if you don't want to read these links at the bottom:

  • At index time, a number such as 1234 this gets converted into multiple tokens, the original number [1234] and several tokens that represent less precise versions of the original number: [123x], [12xx], and [1xxx]. (This is somewhat of a simplification for ease of communicating the idea.)
  • At query time you take advantage of the fact that the less precise tokens allow us to search over ranges of tokens. So Lucene doesn't do the naive thing - a search by sweeping through the term dictionary of full precision terms, pulling out all matching numbers, and doing a Boolean OR search over all terms. Instead Lucene uses the term that covers the largest ranges possible and ORs together this much smaller set. For example, to search for all numbers that range from 1776 to 2015 Lucene would OR together these tokens: [1776], [1777], [1778], [1779], [18xx], [19xx], [200x], [2010], [2011], [2012], [2013], [2014], [2015].

Here's a nice walk though: