7
votes

I'm trying to understand if the Spark Driver is a single point of failure when deploying in cluster mode for Yarn. So I'd like to get a better grasp of the innards of the failover process regarding the YARN Container of the Spark Driver in this context.

I know that the Spark Driver will run in the Spark Application Master inside a Yarn Container. The Spark Application Master will request resources to the YARN Resource Manager if required. But I haven't been able to find a document with enough detail about the failover process in the event of the YARN Container of the Spark Application Master (and Spark driver) failing.

I'm trying to find out some detailed resources that can allow me to answer some questions related to the following scenario: If the host machine of the YARN Container that runs the Spark Application Master / Spark Driver losses network connectivity for 1 hour:

  1. Does the YARN Resource Manager spawn a new YARN Container with another Spark Application Master/Spark Driver?

  2. In that case (spawning a new YARN Container), does it start the Spark Driver from scratch if at least 1 stage in 1 of the Executors had been completed and notified as such to the original Driver before it failed? Does the option used in persist() make a difference here? And will the new Spark Driver know that the executor had completed 1 stage? Would Tachyon help out in this scenario?

  3. Does a failback process get triggered if network connectivity is recovered in the YARN Container's host machine of the original Spark Application Master? I guess that this behaviour can be controlled from YARN, but I don't know what's the default when deploying SPARK in cluster mode.

I'd really appreciate it if you can point me out to some documents / web pages where the Architecture of Spark in yarn-cluster mode and the failover process are explored in detail.

1

1 Answers

5
votes

We just started running on YARN, so I don't know much. But I'm almost certain we had no automatic failover at the driver's level. (We implemented some on our own.)

I would not expect there to be any default failover solution for the driver. You (the driver author) are the only one who knows how to health-check your application. And the state that lives in the driver is not something that can be automatically serialized. When a SparkContext is destroyed, all the RDDs created in it are lost, because they are meaningless without the running application.

What you can do

The recovery strategy we have implemented is very simple. After every costly Spark operation we make a manual checkpoint. We save the RDD to disk (think saveAsTextFile) and load it back right away. This erases the lineage of the RDD, so it will be reloaded rather than recalculated if a partition is lost.

We also store what we have done and the file name. So if the driver restarts, it can pick up where it left off, at the granularity of such operations.