0
votes

We have embedded hazelcast cluster with 10 aws instances. Version of hazelcast is 3.7.3 Right now we have following settings for the hazelcast

hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=150                
hazelcast.heartbeat.interval.seconds=1
hazelcast.operation.call.timeout.millis=5000
hazelcast.merge.first.run.delay.seconds=60

Apart from above settings other property values are default.

Recently one of the node was not reachable for few minutes or so and some of the operations slowed down while getting things from cache. We have backup for each map so if things were not available from one partition, hazelcast should have responded from another partition but it seems everything slowed down because of one node not reachable.

Following is the exception that we saw in the logs for hazelcast.

[3.7.2] PartitionIteratingOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-05-30 16:12:52.442. Total elapsed time: 10825 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-05-30 16:12:42.166. Invocation{op=com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation{serviceName='hz:impl:mapService', identityHash=1798676695, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1496160761670 (2017-05-30 16:12:41.670), waitTimeout=-1, callTimeout=5000, operationFactory=com.hazelcast.map.impl.operation.MapGetAllOperationFactory@2afbcab7}, tryCount=10, tryPauseMillis=300, invokeCount=1, callTimeoutMillis=5000, firstInvocationTimeMs=1496160761617, firstInvocationTime='2017-05-30 16:12:41.617', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[172.18.84.36]:9123, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=12, /172.18.64.219:9123->/172.18.84.36:48180, endpoint=[172.18.84.36]:9123, alive=true, type=MEMBER]}

Can someone suggest what should be the correct settings for hazelcast so that one node temporary not reachable doesn't slow down the whole cluster?

2

2 Answers

0
votes

Operation call timeout should not be set to a low value. Probably best to leave it at the default value. Some internal mechanism like heartbeat rely on call timeout.

0
votes

According to the reference manual version 3.11.7.

enter image description here

I will recommend reading the split-brain syndrome.

enter image description here

Maybe you should create another quorum to fall back in the case that your node fails to communicate.

Also, by experience I will recommend to get the reference manual specific for your version. Even if the default is suppose to be set as 5, I found that the specific version recommends other values.