1
votes

I am planning to use Lucene to index a very large corpus of text documents. I know how an inverted index and all that works.

Question: Does Lucene store the actual source documents in its index (in addition to the terms)? So if I search for a term and want all the documents that contain the term, do the documents come out of Lucene, or does Lucene just return pointers (e.g. the file path to the matched documents)?

1

1 Answers

2
votes

This is up to you. Lucene represents documents as collections of fields, and for each field you can configure whether it is stored. Typically, you would store the title fields, but not the body fields, when handling largish documents, and you'd add an identifier field (not indexed) that can be used to retrieve the actual document.