It might be due to the refresh
setting. When you are indexing a document by default it is not directly available for search, a refresh operation must occur first.
Maybe you are in the following scenario:
- Search user X
- No match => index user X
- Search user X
- Still no match because refresh has not occurred yet => Index user X again
- Refresh
By default the refresh operation occurs every 1 second, so if two search for the same user happen in less than 1 second, you are likely to index the document twice.
How to avoid this problem ?
If you can generate an id for the document, you can use the doc_as_upsert
parameter in the update api. The document will be created if it does not exists and updated otherwise.
Otherwise you can force a refresh of the index before each search. It is not recommended since the refresh operation is heavy, but it will allows you to ensures that this was the cause of the problem.
Note that you will still have to use some internal synchronization mechanism because the index operation may occur between the force refresh
operation and the search operation. See for example the following scenario:
- Thread 1 : refresh and search document X => No result
- Thread 2 : Index document X
- Thread 1 : Search and index document X due to no result
Remove duplicated documents
If you accept to have some duplicated documents for a short amount of time, you can use the following solution.
Each time you index a new document, you keep in a data structure the following information :
- time of the indexation request
- id of the newly indexed document
- the search query used to check if the document is unique
- a flag to indicate if the entry has been searched again which initialized to false
You will then have to run after each refresh operation (by default each second) the following test :
Foreach entry in the datastructure
If indexationTime > now - refresh delay AND NOT entry.flag
// The indexed document corresponding to the entry is searchable
entry.flag = true
// Avoid running the search another time
Rerun the corresponding search query considering only the ids in the datastructure to speed up search.
If there is multiple response, remove the duplicates
To search only on a selected set of ids, you can use ids query
You will also have to remove from the data structure the entries that can be safely discarded. That is the entries that have the flag set to true
and where all the other other entries indexed before indexationTime + refreshDuration
have also their flag set to true
.