We have the following Scenario:
- Elastic Search is built on Lucene.
- Index baseline of 14 million documents (Batch Indexed)
- Each week about 20 thousands documents get deleted and also about 30 thousands of the documents get reindexed or updated. Indexing happens in batches of 2000 documents via the Bulk-API.
At first we handle the deletion of the documents and afterwards the update appears. FYI, it can happen, that we delete a document which will be indexed again some minutes by the updater again.
My Question now: If ES marks a document (ID:D123) as deleted in a segment (lets say A), but afterwards a document with the same ID (ID:D123) gets indexed into another segment (B), the document should be searchable. BUT, what happens if the segment merge occurs?
Segment B will be merged into Segment A which contains the delete flag for the same document ID (ID:D123).
After the merge, does the document still have the delete flag? I know, if a segment gets merged the deleted documents are not merged. But, does it matter which way around the merge happens? Segment A into B or B into A?
We lose some documents with this scenario and still cannot find out why.
For a short term solution, I filter out the documents to be deleted after reindexing.
I'd like to understand the whole process. It seems doesn't consistent at all!
Thanks