0
votes

From what I know about the Solr update process, it deletes and newly adds a document in order to perform an "update". This typically applies if a document is added that has the same id as an existing one, but other fields/values can differ.

My question is: does Solr have a clever internal mechanism to detect when exactly the same document is added once again, i.e. an added document has the same id and the same fields with the same values as an existing one?

The question relates to a use case in which I am trying to optimize an indexing process. A large collection of documents is added to a new index, but the process may be interrupted in between. I wonder if I waste more time by figuring out more or less manually which documents have already been indexed successfully, or by re-doing the whole indexing process. In an ideal world, I would re-start the indexing process and rely on Solr to check whether added documents are already in the index.

I couldn't come up with any logic that seems more efficient than just re-indexing any added document, but perhaps some Solr/Lucene developer could.

1
what kind of update handler you use?Mysterion
@Mysterion: I use Solrj which is why I use the default solr.DirectUpdateHandler2. This could be changed if helpful, I suppose.Carsten

1 Answers

0
votes

If the ID already exists, it overwrites everything even if all the fields are identical. The version changes for that ID.