0
votes

I'm new to Solr and I want to understand exactly how it indexes documents.

Let's say I have a 100 MB document (document1) full of text. The text is not structured, it's just raw text. I send that document to Solr in order to be indexed.

As far as I understood, Lucene will parse the document, extract all the words, based on the default schema (let's assume we're using the default schema), and create an index that is basically a mapping between a word and a list of documents, like so:

word1 -> [document1]

word2 -> [document1]

etc

Now, if I want to search for the word "word1", Solr will give me the entire 100 MB document that contains the word "word1", correct?

Please correct me if I'm wrong, I need to understand exactly how it works.

1

1 Answers

1
votes

You described most of the indexing part kinda okay, at least at high level. The reason, why you getting all your document back - it is because your field is a stored one in your Solr schema (which is true by default at least)

This means, that apart from having postings list of word1 -> doc1, doc3 word2 -> doc2, doc3 etc.

Solr/Lucene also stores the original content of the field, so it will be able to return it back to you. You could either explictily turn it off by saying stored=false in your schema or by filtering it out in fl section and just request fl=id (or something similar)

If you would like to return part of the document only, around searched ones, you could do that by using Solr Highlighting feature. Highlighting in Solr allows fragments of documents that match the user’s query to be included with the query response.