Data receiving in Spark Streaming

Question

Recently I've been doing performance tests on Spark Streaming. But some problems puzzled me a lot.
In Spark Streaming, receivers are scheduled to run in executors on worker nodes.

How many receivers are there in a cluster? Can I control the number of receivers?
If not all workers run receiver to receive stream data, the other worker nodes will not receive any data? In that case, how can I guarantee task scheduling based on data locality? Copy data from nodes those run receivers?

Justin Pihony Justin Pihony · Accepted Answer · 2015-05-26T03:26:51

There is only one receiver per DStream, but you can create more than one DStream and union them together to act as one. This is why it is suggested to run Spark Streaming against a cluster that is at least N(receivers) + 1 cores. Once the data is past the receiving portion, it is mostly a simple Spark application and follows the same rules of a batch job. (This is why streaming is referred to as micro-batching)

Data receiving in Spark Streaming

1 Answers