What happens if I cache the same RDD twice in Spark

Question

I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1 and t2 are calculated I will have two instances of r in the cache? or will spark is aware of the fact that r is cached and will ignore it?

Tzach Zohar Tzach Zohar · Accepted Answer · 2016-03-24T07:51:53

Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:

When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
When you call cache again, it's set to the same value (no change)
Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

So you're safe.

What happens if I cache the same RDD twice in Spark

2 Answers