0
votes

I want to ask few questions to understand the working of YARN:

  1. Anyone can explain or refer to any document which can easily about the failure modes in YARN (i.e. Task Failure, Application master failure, Node Manager failure, Resource manager failure)
  2. What is the container size in YARN? is it same as slot in Map reduce 1?
  3. Any practical/working example of YARN ? Thank you
1

1 Answers

2
votes

Refer to Hadoop Definitive Guide text book ... Apart from that there is lot of info in apache web site.

Container size is not fixed it is dynamically allocated based on requirement by Resource Manager.

From developer perspective same old map-reduce will work on YARN.

ResourceManager failures

In the initial versions of the YARN framework, ResourceManager failures meant a total cluster failure, as it was a single point of failure. The ResourceManager stores the state of the cluster, such as the metadata of the submitted application, information on cluster resource containers, information on the cluster’s general configurations, and so on. Therefore, if the ResourceManager goes down because of some hardware failure, then there is no way to avoid manually debugging the cluster and restarting the ResourceManager. During the time the ResourceManager is down, the cluster is unavailable, and once it gets restarted, all jobs would need a restart, so the half-completed jobs lose any data and need to be restarted again. In short, a restart of the ResourceManager used to restart all the running ApplicationMasters. The latest versions of YARN address this problem in two ways. One way is by creating an active-passive ResourceManager architecture, so that when one goes down, another becomes active and takes responsibility for the cluster. Another way is by using the Zookeeper ResourceManager quorum, so that the ResourceManager state is stored externally over the Zookeeper, and one ResourceManager is in an active state and one or more ResourceManagers are in passive mode, waiting for something to happen that brings them to an active state.

ApplicationMaster failures When the ApplicationMaster fails, the ResourceManager simply starts another container with a new ApplicationMaster running in it for another application attempt. It is the responsibility of the new ApplicationMaster to recover the state of the older ApplicationMaster, and this is possible only when ApplicationMasters persist their states in the external location so that it can be used for future reference. ApplicatoinMaster will store their state to persisitant disk thus all the status till the failure can be recovered.

NodeManager Failures If a Node Manager fails, the ResourceManager detects this failure using a time-out (that is, stops receiving the heartbeats from the NodeManager). The ResourceManager then removes the NodeManager from its pool of available NodeManagers. It also kills all the containers running on that node & reports the failure to all running AMs. AMs are then responsible for reacting to node failures, by redoing the work done by any containers running on that node during the fault.

Container Failures

Container failures will be reported by node manager to Resource manager and Resource manager informs the same to Application Master. Now Application will restart the container.