0
votes

We are running a 3 node cluster and with hadoop 2.6 and YARN Resource Manager(RM) is HA. When we network partition node where RM is active, we observe RM which was on standby gets into active mode however, all the samza containers die on all the nodes and are recreated. We made sure not to network partition node running job coordinator.

Assuming N1(Node 1) is running standby RM. N2 is running active RM. N3 is running job coordinator.

We run following command on N1 and N3 for network partition sudo route add -host <N2_IP> reject

and this on N2 sudo route add -host <N1_IP> reject && sudo route add -host <N3_IP> reject

These commands are run on all three nodes simultaneously.

My question is why old containers were killed and re-created.

Node manager logs:

2020-10-07 10:38:36,985 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_e02_1602050235280_0001_01_000004 transitioned from RUNNING to KILLING

020-10-07 10:38:35,971 WARN org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Cannot get RMApp by appId=application_1601919334597_0001, just added it to finishedApplications list for cleanup 2020-10-07 10:38:38,290 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Container container_e02_1602050235280_0001_01_000004 completed with event FINISHED, but corresponding RMContainer doesn't exist.