Spark Streaming and High Availability

Question

I'm building Apache Spark application that acts on multiple streams.

I did read the Performance Tuning section of the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

What I didn't get is:

1) Are the streaming receivers located on multiple worker nodes or is the driver machine?

2) What happens if one of the nodes that receives the data fails (power off/restart)

Yuval Itzchakov Yuval Itzchakov · Accepted Answer · 2016-09-15T14:10:36

Are the streaming receivers located on multiple worker nodes or is the driver machine

Receivers are located on worker nodes, which are responsible for the consumption of the source which holds the data.

What happens if one of the nodes that receives the data fails (power off/restart)

The receiver is located on the worker node. The worker node get's it's tasks from the driver. This driver can either be located on a dedicated master server if you're running in Client Mode, or it can be on one of the workers if you're running in Cluster Mode. In case a node fails which doesn't run the driver, the driver will re-assign the partitions held on the failed node to a different one, which will then be able to re-read the data from the source, and do the additional processing needed to recover from the failure.

This is why a replayable source such as Kafka or AWS Kinesis is needed.

Spark Streaming and High Availability

1 Answers