Changes to ignite cluster membership unexplainable

Question

I am running a 12 node jvm ignite cluster. Eeach jvm runs on its own vmware node. I am using zookeeper to keep these ignite nodes in sync using tcp discovery. I have been seeing lot of node failures in zookeeper logs although the java processes are running, I don't know why some ignite nodes leave the cluster with "node failed" kind of errors. Vmware uses vmotion to do something what they call as "migration".I am assuming that is some kind of filesystem sync process between vmware nodes. I am also seeing pretty frequent "dumping pending object" and "Failed to wait for partition map exchange" kind of messages in the jvm logs for ignite. My env setup is as follows:

Apache Ignite 1.9.0
RHEL 7.2 (Maipo) runs on each of the 12 nodes
Oracle Jdk1.8.
Zookeeper 3.4.9

Please let me know your thoughts.

TIA

Could ntp settings cause any weird behavior? Some of the 12 nodes have ntp turned on and some don't. Some are ntp synchronized and others are not. — ZeroGraviti

Valentin Kulichenko Valentin Kulichenko · Accepted Answer · 2017-05-22T09:23:37

There are generally two possible reasons:

Memory issues. For example, if a node goes to long GC pause, it can become unresponsive and therefore removed from topology. For more details read here: https://apacheignite.readme.io/docs/jvm-and-system-tuning
Network connectivity issues. Check if the network between your VMs is stable. You may also want to try increasing the failure detection timeout: https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout

Changes to ignite cluster membership unexplainable

2 Answers