RDD distribution in Spark Streaming

Question

In spark streaming, the received data is replicated among multiple Spark executors in worker nodes in the cluster (default replication factor is 2)(http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html). But how can I get the location of the replication of an specific RDD?

What are you trying to accomplish, maybe we can figure out another way. — Holden
I want to know how spark achieves workload balance if the receiver nodes continuously receive data and do replication of RDD blocks. — Xingjun Wang

Gonzalo Herreros Gonzalo Herreros · Accepted Answer · 2015-11-20T10:02:21

In Spark UI there is a tab called "Storage" that tell you which RDDs are cached and where (memory, disk, serialized, etc).

For Spark Streaming by default it will serialize the RDD in memory and remove old ones as needed. If you don't have computations that depend on previous results it's better if you set spark.streaming.unpersist to True, so once processed get's removed to avoid putting pressure on the garbage collector.

RDD distribution in Spark Streaming

1 Answers