I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:
public JavaRDD<String> foo(JavaRDD<String> r) {
r.cache();
JavaRDD t1 = r... //Some calculations
JavaRDD t2 = r... //Other calculations
return t1.union(t2);
}
My question is, since r
is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1
and t2
are calculated I will have two instances of r
in the cache? or will spark is aware of the fact that r
is cached and will ignore it?