Hazelcast : Tuning properties for a node having temporary network glitch in a cluster

Question

We have embedded hazelcast cluster with 10 aws instances. Version of hazelcast is 3.7.3 Right now we have following settings for the hazelcast

hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=150                
hazelcast.heartbeat.interval.seconds=1
hazelcast.operation.call.timeout.millis=5000
hazelcast.merge.first.run.delay.seconds=60

Apart from above settings other property values are default.

Recently one of the node was not reachable for few minutes or so and some of the operations slowed down while getting things from cache. We have backup for each map so if things were not available from one partition, hazelcast should have responded from another partition but it seems everything slowed down because of one node not reachable.

Following is the exception that we saw in the logs for hazelcast.

[3.7.2] PartitionIteratingOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-05-30 16:12:52.442. Total elapsed time: 10825 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-05-30 16:12:42.166. Invocation{op=com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation{serviceName='hz:impl:mapService', identityHash=1798676695, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1496160761670 (2017-05-30 16:12:41.670), waitTimeout=-1, callTimeout=5000, operationFactory=com.hazelcast.map.impl.operation.MapGetAllOperationFactory@2afbcab7}, tryCount=10, tryPauseMillis=300, invokeCount=1, callTimeoutMillis=5000, firstInvocationTimeMs=1496160761617, firstInvocationTime='2017-05-30 16:12:41.617', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[172.18.84.36]:9123, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=12, /172.18.64.219:9123->/172.18.84.36:48180, endpoint=[172.18.84.36]:9123, alive=true, type=MEMBER]}

Can someone suggest what should be the correct settings for hazelcast so that one node temporary not reachable doesn't slow down the whole cluster?

pveentjer pveentjer · Accepted Answer · 2017-06-02T05:25:20

Operation call timeout should not be set to a low value. Probably best to leave it at the default value. Some internal mechanism like heartbeat rely on call timeout.

Hazelcast : Tuning properties for a node having temporary network glitch in a cluster

2 Answers