1
votes

I am able to index a document (Word, PDF) using Solr. Is there a possibility to get an original document back? I assume NO, because Solr stores an index only - but could you correct me if i am wrong with it?

If no - how typically is it resolved (I mean retrieving original docs back?) Storing them in a separate storage?

1

1 Answers

2
votes

@Alec Your understanding is correct. You can't get back the original documents. As such your alternative is to store the original documents separately, have an unique ID generated in your main data store and link that unique ID to the SOLR export of the document so you can link back the search results. In fact SOLR is designed for speed of search and is not as transaction friendly as a RDBMS. So in my projects I use this strategy of maintaining an alternative datastore as the authoritative source of all application data (not just docs).

To give a bit about the internals of the document handling I'll suggest you look at the example on Solr Wiki https://wiki.apache.org/solr/ExtractingRequestHandler.

More later versions are documented here https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Docs say Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

This means that only the extracted text is actually stored in SOLR. The raw binary content is not really of use to SOLR for search / indexing purposes (and is presumably discarded although I haven't found exact text saying they discard the raw binary content of docs thus extracted).