In spark streaming, stream data will be received by receivers which run on workers. The data will be pushed into a data block periodically and receiver will send the receivedBlockInfo to the driver. I want to know that will spark streaming distribute the block to the cluster?(In other words, will it use a distributing storage strategy). If it does not distribute the data across the cluster, how will the workload balance be guaranteed?(Image we have a cluster of 10s nodes but there are only a few receivers)
1
votes
Dstream divides data to RDD at each batch. RDD, by definition, is distributed. So yes, data is distributed in each batch
– ayan guha
Will it duplicate each batch of a RDD in a DStream and distribute the duplication to other worker node?
– Xingjun Wang
sorry not sure I understood you. DStream represents a stream of RDDs, where each RDD within a dsteam represents datareceived during a batch. Say, you have a Dstream with interval 1 min. And you run this stream for 1 hour. Then you will endup with 60 RDDs. Each RDD is distributed.Say, each RDD has2 partitions. Then you will now have 120 partitions. Each partition will reside in 1 executor. In case of n/w streams, data is replicated.
– ayan guha
I see. I just want to know when spark streaming receives data from n/w, will the replicated data be transferred to other nodes. Now I understand it .Thank you:-)
– Xingjun Wang
1 Answers
1
votes
As far as I know data are received by the worker node where the receiver is running. They are not distributed across other nodes.
If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning