What does Lucene's in-memory representation of the index look like?

Question

What does the in-memory representation (as opposed to the file format) of Lucene's index look like? Is the whole reverse index loaded into memory e.g. as an array of posting lists (where each posting list contains document IDs, terms frequencies in the document, and positions)? Something like

class Posting {
  private int docID;
  private int termFreq;
  private int[] termPositions;
}

class PostingList {
  private Posting[] postings;
}

public class SomeClassThatHoldsTheIndexInMemory {
  private PostingList[] index;  // Indexed by some internal term ID?
}

I understand that everything that makes up the index (including auxiliary information about terms) might not be held in memory, but surely something is?

Which classes define the in-memory representation of the index? If the index looks something like the above, how does Lucene go from a term (a string) to a term ID (an int)?

Martín Schonaker Martín Schonaker · Accepted Answer · 2013-03-09T18:04:00

Lucene in-memory representation is defined thought the RAMDirectory class. Which is, basically, a HashMap of String (keys) and (RAMFiles). RAMFile are, in turn, a list of byte buffers representing the bytes of a file. The same information that you would store in a FSDirectory.

Lucene stores the inverted index. The index is organized as a set of incremental (possibly unmerged) segments. Each segment belonging to an "index commit", and each segment is more or less another inverted index. You can even find "segments" holding the inverted index for only one document.

"Posting" or Document original structure is lost as soon as you add it to an index. Moreover, you can't iterate over the whole collection of documents (as far as I know). Anyway, nothing prevents you to store your postings/documents in a secondary structure or to store in the index its serialized version or to store its object properties as StoredFields one by one; nor define your own "iterable" document IDs in a field.

DirectoryReader and SegmentReaders deal with the internal structures of the index.

In the time I have used Lucene, I never saw something like a "term ID". However a "Document ID"s are a common concept.

What does Lucene's in-memory representation of the index look like?

1 Answers