1
votes

When examining the status of indices in our Elasticsearch instance using curl 'http://localhost:9200/_cat/indices?v' the number of documents, docs.count in each index is frequently larger than the number of search results returned when searching all documents on that index.

Sometimes it is an integer multiple of the search hits but not always. In one case there are 98160 hits for match_all but 805383 documents in the index.

Note that there are no nested documents in the mappings.

What is the explanation? Note that search does seem to functioning normally.

2
Can you provide the output that _cat/indices gives you?Val

2 Answers

0
votes

This could be potentially be because your data is sharded into multiple nodes (multi node cluster setup) with no replicas, and probably one of the node are down while you are performing search queries.

For instance, If I have a cluster of only one node, and the node has 1 index with 4 documents, I will get the following output when i examine indices,

health status index pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   blog    5   1          4            0     10.9kb         10.9kb 

Now, if I run match_all query,

{
    "query": {
        "match_all": {}
    }
}

I will get,

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [........

Notice how docs.count equals to hits count. In above output, observe the number of shards, which are 5. All those shards are assigned to a single node.

But if I had a multi node setup with replicas not configured, those shards will be distributed among multiple nodes.

Assume that I have a two node cluster having Node 1 and Node 2, with a total of 5 shards, out of those 5 shards shard 0, 1 , 3 were assigned to Node 2 and that node is down for maintenance or not available for whatever reason. In this scenario, you only have shard 2 and 4 available through Node 1. Now if you attempt to retrieve or search data, what will happen? Elasticsearch will serve you search result from the surviving node i.e. Node 1.

Number of hits in this case will always be less than docs.count value.

This kind of uncertainty can be avoided by using replicas

0
votes

matches all documents, giving them all a _score of 1.0.

One thing to note is that this query will not work as expected if the email field is analyzed, which is the default for fields in Elasticsearch. In this case, the email field will be broken up into three parts: joe, blogs, and com. This means that it will match searches and documents for any three of those terms. link

how scoring works