1
votes

Let say I have 5 data nodes. Then I save a Person document. Now how couple of questions:

  1. How can I find which node is the saved document saved to?

  2. After saving one Person document to a node with two replicas how can I query for this Person and get info which replica/node does the resulting answer comes from?

  3. How can I check how fast the document is available in two replicas of a node?

EDIT

The use case is as follows: In general how to assure consistency in case when a primary shard have new data written but the data has not yet been synchronised with a replica. At the same time the replica is being queried for the new data that is present only at the primary shard at the time of querying the replica. Pretty much I wonder about DETAILS of consistency in situation as described in last paragraph of the distributed read documentation ===> but on the other hand here the doc says about query phase that each primary and replica are queried and build priority queues that are later merged, thus the result form primary shard would be included in merged queue based at the globally sorted result set build out of all priority queues at the coordinating node.

  • Question X So is the exclusive doc from primary shard returned at search or not in case it is not being replicated to remaining replicas?

In other words. I want to assure data consistency across my distributed ES cluster. Now I want to test if the following situation can take place. Lets say I have one cluster with 5 nodes and the data are put only to one node (e.g. node2 with primary shard). Before the data have time to replicate to remaining replicas I got query for this new data towards node3 which in theory should have the replica of the data, but didn't get it yet after the node2 got changed. So in this case query committed towards the node3 requesting the new data would have not return the new data even though they have been put to 'node2'.

  • Question A) If this might happen how can I control the replication phases/state so that I can tell if the replication is complete?
  • Question B) How can I tell if the replica is consistent with the primary shard or not, and in what state it is (replica's data is consistent or incosistent with primary shard)?
  • Question C) If I can't control this replication flow and data consistency how can I eliminate potential inconsistencies for query committed toward node3?
  • Question D) How can I observe the behaviour of adding a doc to primary shard, and not having it stored at the replica shard (e.g. can I slow down / customize the time of replication or can I test this behaviour some other way)?
1
Just out of curiosity, may I ask why you care so much about these low-level details? What's the use case behind your needs?Val
Well I just want to test what whill happen if two nodes are going to be made available for querying. Than what will happen if the document from node1 is not yet at node2 but node2 is being queried for the document?mCs

1 Answers

2
votes

How can I find which node is the saved document saved to?

The more correct question is on which shard the document is saved to, because shards can be moved around in a cluster. And you can use the _search_shards API and provide the ID of the document:

GET /index/type/_search_shards?routing=4

After saving one Person document to a node with two replicas how can I query for this Person and get info which replica/node does the resulting answer comes from?

I don't think you can do it easily. You could lower the thresholds for slowlogs and check the slowlog files for that specific fetch phase of the search request to see if a certain node logs that. If you find the fetch in the slowlogs that would mean the result (if it's one doc only) came from that node's shards.

How can I check how fast the document is available in two replicas of a node?

The response time you get back from running the indexing operation is the one that includes the indexing on all the copies of the shard (primary and its replicas): https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html#distrib-write

If this might happen how can I control the replication phases/state so that I can tell if the replication is complete?

I think you can try using consistency: all (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-consistency) which means that the indexing operation return only if all the other shards copies have indexed the document. But I don't think this will stop a query made at the right time to one of the replicas which is still in the process of indexing the document from the primary.

How can I tell if the replica is consistent with the primary shard or not that's difficult.

I think only by querying for data on those two copies of the shard you can see if the copies went unsynched.

If I can't control this replication flow and data consistency how can I eliminate potential inconsistencies

If you notice an inconsistency the only option I believe is to set your replicas count to 0 (delete the replicas) and then back to the initial value. Basically, recreating the replicas from the primary.