0
votes

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).

So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?

2

2 Answers

1
votes

The Spark driver is a single point of failure because it holds all cluster state for the running App.

In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.

So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.

0
votes

EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only. So to handle your situation in a best way, you may consider to use both CORE and TASK group. What you can do to tackle this -

  1. MASTER: On-demand
  2. CORE: On-demand. Minimum no of Instances can be 1.
  3. TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.

This will reduce your cost also ensure that node containing the driver process never goes down.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html