This could be potentially be because your data is sharded into multiple nodes (multi node cluster setup) with no replicas
, and probably one of the node are down while you are performing search queries.
For instance,
If I have a cluster of only one node, and the node has 1 index
with 4
documents
, I will get the following output when i examine indices
,
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open blog 5 1 4 0 10.9kb 10.9kb
Now, if I run match_all
query,
{
"query": {
"match_all": {}
}
}
I will get,
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [........
Notice how docs.count
equals to hits
count. In above output, observe the number of shards, which are 5
. All those shards are assigned to a single node.
But if I had a multi node setup with replicas
not configured, those shards will be distributed among multiple nodes.
Assume that I have a two node cluster having Node 1 and Node 2, with a total of 5 shards, out of those 5 shards shard 0, 1 , 3 were assigned to Node 2 and that node is down for maintenance or not available for whatever reason. In this scenario, you only have shard 2
and 4
available through Node 1. Now if you attempt to retrieve or search data, what will happen? Elasticsearch will serve you search result from the surviving node i.e. Node 1.
Number of hits in this case will always be less than docs.count
value.
This kind of uncertainty can be avoided by using replicas
_cat/indices
gives you? – Val