How does lucene store a document?

Question

Basically, how are each field inside a document stored in the inverted index? Does Lucene internally create a separate index for each field? Also Suppose a query is on a specific field, how does search works for it internally?

I know how inverted indices work. But how do you store multiple fields in a single index and how do you differentiate when to only search on particular fields when requested.

If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats). — andrewJames

andrewJames andrewJames · Accepted Answer · 2021-05-11T19:22:19

As I mentioned in my comment, If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats).

Below is a sample of what you can expect to see when you use the SimpleTextCodec.

How do you store multiple fields in a single index?

To show a basic example, assume we have a Lucene text field defined as follows:

Field textField1 = new TextField("bodytext1", content, Field.Store.NO);

And assume we have two documents as follows (analyzed using the StandardAnalyzer:

Document 0: echo charlie delta echo
Document 1: bravo alfa charlie

This will give us a basic hierarchical index structure as follows:

field bodytext1
  term alfa
    doc 1
      freq 1
      pos 1
  term bravo
    doc 1
      freq 1
      pos 0
  term charlie
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 2
  term delta
    doc 0
      freq 1
      pos 2
  term echo
    doc 0
      freq 2
      pos 0
      pos 3

The general structure is therefore:

field [field 1]
  term [token value]
    doc [document ID]
      frequency
      position
field [field 2]
  term [token value]
    doc [document ID]
      frequency
      position

And so on, for as many fields as are indexed.

This structure supports basic field-based querying.

You can summarize it as:

field > term > doc > freq/pos

So, "does Lucene internally create a separate index for each field?" Yes, it does.

Lucene can also store other additional structures in its index files, depending on how you configure your Lucene fields - so, this is not the only way data can be indexed.

For example you can request "term vector" data to also be indexed, in which case you will see an additional index structure:

doc 0
  numfields 1
  field 1
    name content2
    positions true
    offsets   true
    payloads  false
    numterms 3
    term charlie
      freq 1
      position 1
        startoffset 6
        endoffset 13
    term delta
      freq 1
      position 2
        startoffset 15
        endoffset 20
    term echo
      freq 2
      position 0
        startoffset 0
        endoffset 4
      position 3
        startoffset 23
        endoffset 27
doc 1
  ...

This structure starts with documents, not fields - and is therefore well suited for processing which already has a document selected (e.g. the "top hit" document). With this, it is easy to locate the position of a matched word in a specific document field.

This is far from a comprehensive list. But by using SimpleTextCodec, together with different field types, documents and analyzers, you can see for yourself exactly how Lucene indexes its data.

How does lucene store a document?

1 Answers