Can spark utilize a persisted rdd across jobs?

Question

When I run my spark application. Several jobs are spawned. Each Job has several stages.

I'm experimenting with persisting RDDs. I'm persisting an RDD to disk. But there is no way I can tell if it is being reused across the job.

When I look at the DAG, I do see a green dot signifying that an rdd is persisted. But I also see the previous map/filter etc in the dag.

For example in Job-0 Dag I see:

RandomRDD [0] -> MapParitionRDD [1] -> MapParitionRDD [2] (green) -> Filter [3]...

And then for Job-1 Dag I also see:

RandomRDD [0] -> MapParitionRDD [1] -> MapParitionRDD [2] (green) -> Filter [3]...

How can I tell if rdd[0], rdd[1] & rdd[2] were recalculated or simply dehydrate?

In general by looking at the job-history how can you tell if an rdd was recalculated or simply dehydrated?

In your logfile on the worker nodes search for something like: < BlockManager: Found block rdd_9_1 locally >. CacheManager in Spark is responsible for passing RDDs partition contents to Block Manager and making sure a node doesn’t load two copies of an RDD at once. If you see something like this: < CacheManager: Partition rdd_9_1 not found, computing it > Then it means RDD is larger than the free memory it will be automatically paged to disk and recalculated when needed jaceklaskowski.gitbooks.io/mastering-apache-spark/content/… — blockR

David David · Accepted Answer · 2016-06-09T16:14:09

The calculations necessary to produce the RDDs upstream (eg 0 & 1) of the persisted one, and the persisted one (2) will not be done. To test it, do some simple calculation on RDD2 before persisting it and after persisting it and notice the difference in time.

Spark Persistence Documentation

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Can spark utilize a persisted rdd across jobs?

1 Answers