As I mentioned in my comment, If you want to see how Lucene stores indexed data, you can use the SimpleTextCodec. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats).
Below is a sample of what you can expect to see when you use the SimpleTextCodec
.
How do you store multiple fields in a single index?
To show a basic example, assume we have a Lucene text field defined as follows:
Field textField1 = new TextField("bodytext1", content, Field.Store.NO);
And assume we have two documents as follows (analyzed using the StandardAnalyzer:
Document 0: echo charlie delta echo
Document 1: bravo alfa charlie
This will give us a basic hierarchical index structure as follows:
field bodytext1
term alfa
doc 1
freq 1
pos 1
term bravo
doc 1
freq 1
pos 0
term charlie
doc 0
freq 1
pos 1
doc 1
freq 1
pos 2
term delta
doc 0
freq 1
pos 2
term echo
doc 0
freq 2
pos 0
pos 3
The general structure is therefore:
field [field 1]
term [token value]
doc [document ID]
frequency
position
field [field 2]
term [token value]
doc [document ID]
frequency
position
And so on, for as many fields as are indexed.
This structure supports basic field-based querying.
You can summarize it as:
field > term > doc > freq/pos
So, "does Lucene internally create a separate index for each field?" Yes, it does.
Lucene can also store other additional structures in its index files, depending on how you configure your Lucene fields - so, this is not the only way data can be indexed.
For example you can request "term vector" data to also be indexed, in which case you will see an additional index structure:
doc 0
numfields 1
field 1
name content2
positions true
offsets true
payloads false
numterms 3
term charlie
freq 1
position 1
startoffset 6
endoffset 13
term delta
freq 1
position 2
startoffset 15
endoffset 20
term echo
freq 2
position 0
startoffset 0
endoffset 4
position 3
startoffset 23
endoffset 27
doc 1
...
This structure starts with documents, not fields - and is therefore well suited for processing which already has a document selected (e.g. the "top hit" document). With this, it is easy to locate the position of a matched word in a specific document field.
This is far from a comprehensive list. But by using SimpleTextCodec
, together with different field types, documents and analyzers, you can see for yourself exactly how Lucene indexes its data.
SimpleTextCodec
. See this answer How to view Lucene Index for more details and some sample code. Basically, this generates human-readable index files (as opposed to the usual binary compressed formats). – andrewJames