In spark streaming, the received data is replicated among multiple Spark executors in worker nodes in the cluster (default replication factor is 2)(http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html). But how can I get the location of the replication of an specific RDD?
0
votes
1 Answers
0
votes
In Spark UI there is a tab called "Storage" that tell you which RDDs are cached and where (memory, disk, serialized, etc).
For Spark Streaming by default it will serialize the RDD in memory and remove old ones as needed. If you don't have computations that depend on previous results it's better if you set spark.streaming.unpersist to True, so once processed get's removed to avoid putting pressure on the garbage collector.