We have embedded hazelcast cluster with 10 aws instances. Version of hazelcast is 3.7.3 Right now we have following settings for the hazelcast
hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=150
hazelcast.heartbeat.interval.seconds=1
hazelcast.operation.call.timeout.millis=5000
hazelcast.merge.first.run.delay.seconds=60
Apart from above settings other property values are default.
Recently one of the node was not reachable for few minutes or so and some of the operations slowed down while getting things from cache. We have backup for each map so if things were not available from one partition, hazelcast should have responded from another partition but it seems everything slowed down because of one node not reachable.
Following is the exception that we saw in the logs for hazelcast.
[3.7.2] PartitionIteratingOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-05-30 16:12:52.442. Total elapsed time: 10825 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-05-30 16:12:42.166. Invocation{op=com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation{serviceName='hz:impl:mapService', identityHash=1798676695, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1496160761670 (2017-05-30 16:12:41.670), waitTimeout=-1, callTimeout=5000, operationFactory=com.hazelcast.map.impl.operation.MapGetAllOperationFactory@2afbcab7}, tryCount=10, tryPauseMillis=300, invokeCount=1, callTimeoutMillis=5000, firstInvocationTimeMs=1496160761617, firstInvocationTime='2017-05-30 16:12:41.617', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[172.18.84.36]:9123, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=12, /172.18.64.219:9123->/172.18.84.36:48180, endpoint=[172.18.84.36]:9123, alive=true, type=MEMBER]}
Can someone suggest what should be the correct settings for hazelcast so that one node temporary not reachable doesn't slow down the whole cluster?

