Cluster stuck on high heap usage

Question

I have Elasticsearch v 2.2.0 cluster, 1 node, 4g heap size, 7g RAM, 2 cpu cores, 401 indices, 1,873 shards, 107,780,287 docs, total data 70.19GB.

I also have configured indices.fielddata.cache.size: 40%.

The problem is when I am using Kibana to query some thing (very simple queries), if it a single query it`s working fine, but if I continue to query some more - elastic is getting so slow and eventually stuck because the JVM heap usage (from Marvel) is getting to 87-95%. It happens also when I trying to load some Kibana dashboard and the only solution for this situation is to restart the elastic service or clear all cache.

Why is the heap stuck like that?

EDIT:

_node/stats when heap is stuck

_node/stats when cluster in normal state

EDIT 2:

To better understand the problem, I went as far as analyzing a memory dump. This analysis was performed after the cluster stuck trying some Kibana queries:

Problem Suspect 1:

Problem Suspect 2:

Problem Suspect 3:

I do have, in some indices, settings of _ttl that is NOT working (the _ttl set is for 4 weeks but the documents still there...). I have changed the default mappings since then but have not deleted the "not working ttl" indices.

Can it be the main problem?

Andrei Stefan Andrei Stefan · Accepted Answer · 2016-05-03T04:26:20

I don't think you have other choice now than to add more nodes to your cluster, increase the hardware resources for the current node or don't store that many indices in the cluster.

You have a lot of shards for such a small node and all those shards use some memory (767MB) for the usual things: terms, norms and overall memory used by segments' metadata:

    "segments": {
      "count": 14228,
      "memory_in_bytes": 804235553,
      "terms_memory_in_bytes": 747176621,
      "stored_fields_memory_in_bytes": 31606496,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 694880,
      "doc_values_memory_in_bytes": 24757556,
      "index_writer_memory_in_bytes": 0,
      "index_writer_max_memory_in_bytes": 1381097464,
      "version_map_memory_in_bytes": 39362,
      "fixed_bit_set_memory_in_bytes": 0
    }

You moved to ES 2.x this means you are now using doc_values by default and fielddata usage is, indeed, very small (11.8MB):

    "fielddata": {
      "memory_size_in_bytes": 12301920,
      "evictions": 0
    }

The old filter cache (now called query cache) is, also, very small:

    "query_cache": {
      "memory_size_in_bytes": 302888,

Clearing the cache (fielddata, query cache) I am not so sure it makes a big difference. At the time the stats were gathered the heap usage was at 2.88GB (72%) which is not that high (at 75% the JVM triggers an old GC). But still, to me that is a too small node for that many shards.

One more thing to be aware of, and unrelated to the memory issue:

    "open_file_descriptors": 29461,
    "max_file_descriptors": 65535,

With so many opened file descriptors I strongly suggest to increase the OS limit for the count of open file descriptors.

Cluster stuck on high heap usage

1 Answers