Hazelcast Cluster members going out of memory due to huge number of "IsStillRunningService" objects

Question

We have a system that makes use of Hazelcast IExecutor Service and IMap on 3.5 version. We recently encountered with Hazelcast cluster members going Out of Memory in Production, one after the other and at the end all nodes are crashed with OOM.
While doing the causal analysis, we found that there were thousands of below log entries and log file size grew exponentially. Also the storage space where logs were present, had also ran out of space.

WARNING: [10.7.90.189]:30103 [FB] [3.5] Asking if operation execution has been started: com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$InvokeIsStillRunningOperationRunnable@48b3ac3b
Mar 30, 2016 11:09:29 AM com.hazelcast.spi.impl.operationservice.impl.Invocation
WARNING: [10.7.90.189]:30103 [FB] [3.5] While asking 'is-executing': Invocation{ serviceName='hz:core:partitionService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:core:partition
Service', partitionId=-1, callId=59834, invocationTime=1459349279980, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[1.2.3.4]:30102, b
ackupsExpected=0, backupsCompleted=0}
com.hazelcast.core.OperationTimeoutException: No response for 10000 ms. Aborting invocation! Invocation{ serviceName='hz:core:partitionService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{servic
eName='hz:core:partitionService', partitionId=-1, callId=268177, invocationTime=1459349295209, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Addr
ess[10.7.90.190]:30102, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
        at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
        at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.setOperationTimeout(IsStillRunningService.java:224)
        at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.onFailure(IsStillRunningService.java:219)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:137)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)

I understand that, cluster members will keep making Heartbeats to make sure all the members are alive and I believe default is 10sec. The problem now is that, if incase any of the member goes unresponsive or hugh state, rest of the members will keep making is-executing calls. After looking into the Heap dump, came to know that >73% heap is full of "IsStillRunningService" objects.

Questions:

How to get to know what exactly went wrong?
Running out of storage space is just a co-incidence or might have any corelation? We are suspecting that one might have lead to other, since it happened twice within a week.

Hazelcast XML Configuration:

<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config http://www.hazelcast.com/schema/config/hazelcast-config-3.5.xsd"
           xmlns="http://www.hazelcast.com/schema/config"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <map name="myMap">
        <backup-count>0</backup-count>
        <time-to-live-seconds>43200</time-to-live-seconds>
        <eviction-policy>LRU</eviction-policy>
        <max-size policy="USED_HEAP_PERCENTAGE">75</max-size>
        <eviction-percentage>10</eviction-percentage>
        <in-memory-format>OBJECT</in-memory-format>
    </map>

    <executor-service name="calculation">
            <pool-size>10</pool-size>
        <queue-capacity>400</queue-capacity>
    </executor-service>

    <executor-service name="loader">
            <pool-size>5</pool-size>
        <queue-capacity>400</queue-capacity>
    </executor-service>

    <properties>
        <property name="hazelcast.icmp.timeout">5000</property>
        <property name="hazelcast.initial.wait.seconds">10</property>
        <property name="hazelcast.connection.monitor.interval">5000</property>
    </properties>

    <network>
        <port auto-increment="true" port-count="100">30101</port>
        <join>
            <multicast enabled="false">
                <multicast-group>224.2.2.3</multicast-group>
                <multicast-port>54327</multicast-port>
            </multicast>
            <tcp-ip enabled="true">
                <interface>1.2.3.4</interface>
                <interface>1.2.3.5</interface>
                <interface>1.2.3.6</interface>
            </tcp-ip>
            <aws enabled="false"/>
        </join>
        <interfaces enabled="false">
            <interface>127.0.0.1</interface>
        </interfaces>
    </network>
</hazelcast>

StackTrace
LinkedBlockingQueue which holds IsStillRunningService Objects

pveentjer pveentjer · Accepted Answer · 2016-04-04T03:58:36

Can you upgrade to 3.6. Fixes were added to prevent running into OOME using is-still-running. In 3.7 the whole mechanism is going to be removed and replaced by a less problematic approach.

https://github.com/hazelcast/hazelcast/pull/7719

Hazelcast Cluster members going out of memory due to huge number of "IsStillRunningService" objects

1 Answers