1
votes

Does Elastic/Lucene really need to store all indexed data in a document? Couldn't you just pass data through it so that Lucene may index the words into its hash table and have a single field for each document with the URL (or what ever pointer makes sense for you) that returns where each document came from?

A quick example may be indexing Wikipedia.org. If I pass each webpage to Elastic/Lucene to index - why do I need to save each webpages' main text in a field if Lucene indexes it and has a corresponding URL field to reply for searches?

We pay the cloud so much money to store so much redundant data -- Im just wondering why if Lucene is searching from its hash table and not the actual fields we save data into... why save that data if we dont want it?

Is there a way to index full text documents in Elastic without having to save all of the full text data from those documents?

1

1 Answers

1
votes

There are a lot of options for the _source field. This is the field that actually stored the original document. You can disable it completely or decide which fields to keep. More information can be found in the docs:

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html