
I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:

public JavaRDD<String> foo(JavaRDD<String> r) {
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);

My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1 and t2 are calculated I will have two instances of r in the cache? or will spark is aware of the fact that r is cached and will ignore it?


2 Answers


Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:

  • When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
  • When you call cache again, it's set to the same value (no change)
  • Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

So you're safe.


just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an id internally, spark will use the id to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.

bellow is my code and screenshot:

enter image description here enter image description here

updated [ add code as required ]

### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://', minPartitions=40)\

### try to cache and count again, then take a look at the WEB UI, nothing changes


### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code
