How does Lucene organize and walk through the inverted index?

Question

In SQL an index is typically some kind of balanced tree (ordered nodes that point to the real table row to make searching possible in O(log n)). Walking through such a tree tree actually is the searching process.

Now Lucene uses an inverted index with term frequencies: It stores for each term how often it occurs in which documents. This is easy to understand. But this doesn't explain how a search on such an index is actually performed.

The search string is analyzed and split the same way into terms, of course, and then "the index is searched" for these terms to find the documents containing them – but how? Is the Lucene index itself also ordered and tree-like organized in some way for O(log n) to be possible? Or is walking the Lucene index on search-time actually linear, so O(n)?

Daniel Schneiter Daniel Schneiter · Accepted Answer · 2016-01-15T22:01:07

There is no simple answer to this. First, because the internal format got improved from release to release and second, because with the advent of Lucene 4 configurable codecs were introduced which serve as an abstraction between the logical format and the actual physical format.

An index is made up of shards and replicas, each of them being a Lucene index itself. A Lucene index is then made of of multiple segments, whereas each segment is again a Lucene index. A segment is read-only and made up of multiple artefacts which can be held in the file system or in RAM.

What's in a Lucene index from Adrien Grand is an excellent presentation on the organisation of a Lucene index. This blog article from Michael McCandless and this blog article from Elastic are about codecs introduced with Lucene 4.

So querying a Lucene index actually results in querying multiple segments in parallel making use of a specific codec. A codec can represent a structure in file system or in RAM, typically optimized/compressed for a particular need. Internal format can be anything from a tree, a Hashmap, a Finite State Machine just to name a few. As soon as you make use of wildcard characters in the search your query ("?" or "*") this automatically results in a more or less deep traversal in your index.

How does Lucene organize and walk through the inverted index?

1 Answers