0
votes

I am pretty new to ElasticSearch. We have established an ElasticSearch node with 5 shards with the default configuration of ElasticSearch. All are primary shards with no replication as such.

We store few user related information in ElasticSearch. In one of the use-case, I check in elastic-search if any user with that mobile-number exists in Elastic and if it does I update the linked user or index that as a fresh document.

In certain cases, I get a lot of duplicate indexing requests for the same user and this logic fails. It works for most of the scenarios but sometimes it fails. I am not able to get to the bottom of this problem

From what I learned from the documentation is that there is write consistency if replica shard is involved, but in my case, there are no replica shards as of now. Also, during search elastic makes a request to all the shards so eventually you should get the document.

I am not really able to understand why is it failing. Any help would be appreciated.

1
So what is the problem, is it leading to duplicates records and you want to avoid that??user156327
@AmitKhandelwal - Yes that is the main issueDhruv Saksena
please refer stackoverflow.com/questions/56840637/… and let me know if you see any errors like thisuser156327

1 Answers

0
votes

It might be due to the refresh setting. When you are indexing a document by default it is not directly available for search, a refresh operation must occur first.

Maybe you are in the following scenario:

  1. Search user X
  2. No match => index user X
  3. Search user X
  4. Still no match because refresh has not occurred yet => Index user X again
  5. Refresh

By default the refresh operation occurs every 1 second, so if two search for the same user happen in less than 1 second, you are likely to index the document twice.

How to avoid this problem ?

If you can generate an id for the document, you can use the doc_as_upsert parameter in the update api. The document will be created if it does not exists and updated otherwise.

Otherwise you can force a refresh of the index before each search. It is not recommended since the refresh operation is heavy, but it will allows you to ensures that this was the cause of the problem.

Note that you will still have to use some internal synchronization mechanism because the index operation may occur between the force refresh operation and the search operation. See for example the following scenario:

  1. Thread 1 : refresh and search document X => No result
  2. Thread 2 : Index document X
  3. Thread 1 : Search and index document X due to no result

Remove duplicated documents

If you accept to have some duplicated documents for a short amount of time, you can use the following solution.

Each time you index a new document, you keep in a data structure the following information :

  • time of the indexation request
  • id of the newly indexed document
  • the search query used to check if the document is unique
  • a flag to indicate if the entry has been searched again which initialized to false

You will then have to run after each refresh operation (by default each second) the following test :

Foreach entry in the datastructure
  If indexationTime > now - refresh delay AND NOT entry.flag 
    // The indexed document corresponding to the entry is searchable
    entry.flag = true 
    // Avoid running the search another time
    Rerun the corresponding search query considering only the ids in the datastructure to speed up search.
    If there is multiple response, remove the duplicates 

To search only on a selected set of ids, you can use ids query

You will also have to remove from the data structure the entries that can be safely discarded. That is the entries that have the flag set to true and where all the other other entries indexed before indexationTime + refreshDuration have also their flag set to true.