1
votes

Several sources describe RDDs as ephemeral by default (e.g., this s/o answer) -- meaning that they do not stay in memory unless we call cache() or persist() on them.

So let's say our program involves an ephemeral (not explicitly cached by the user) RDD that is used in a few operations that cause the RDD to materialize. My question is: does Spark discard the materialized ephemeral RDD immediately -- or is it possible that the RDD stays in memory for other operations, even if we never asked for it to be cached?

Also, if an ephemeral RDD stays in memory, is it always only because some LRU policy has not yet kicked it out -- or can it also be because of scheduling optimizations?

I've tried to figure that out with code like that below -- run with Jupyter notebook with python 3.5 and spark 1.6.0, on a 4-core machine -- but I would appreciate an answer by someone who knows for sure.

import pyspark
sc = pyspark.SparkContext()
N = 1000000   # size of dataset
THRESHOLD = 100  # some constant

def f():
    """ do not chache """
    rdd = sc.parallelize(range(N))
    for i in range(10):
        print(rdd.filter(lambda x: x > i * THRESHOLD).count())

def g():
    """ cache """
    rdd = sc.parallelize(range(N)).cache()
    for i in range(10):
        print(rdd.filter(lambda x: x > i * THRESHOLD).count())

For the two functions above, f() does not ask the rdd to persist - but g() does, at the beginning. When I time the two functions, foo() and boo(), I get very comparable performance for the two, as if the cache() call has made no difference. (In fact, the one that uses caching is slower).

%%timeit
f()
> 1 loops, best of 3: 2.19 s per loop

%%timeit
g()
> 1 loops, best of 3: 2.7 s per loop

Actually, even modifying f() to call unpersist() on the RDD does not change things.

def ff():
    """ modified f() with explicit call to unpersist() """
  rdd = sc.parallelize(range(N))
  for i in range(10):
    rdd.unpersist()
    print(rdd.filter(lambda x: x > i * THRESHOLD).count())

%%timeit
ff()
> 1 loops, best of 3: 2.25 s per loop

The documentation for unpersist() states that it "mark[s] the RDD as non-persistent, and remove[s] all blocks for it from memory and disk." Is this really so, though - or does Spark ignore the call to unpersist when it knows it's going to use the RDD down the road?

1

1 Answers

1
votes

There is simply no value in caching here. Creating RDD from a range is extremely cheap (every partition needs only two integers to get going) and action you apply cannot really benefit from caching. persist is applied on the Java object not a Python one, and your code doesn't perform any work between RDD creation and the first transformation.

Even if you ignore all of that this is a very simple task with tiny data. Total cost is most likely driven by scheduling and communication than anything else.

If you want to see caching in action consider following example:

from pyspark import SparkContext
import time

def f(x):
   time.sleep(1)
    return x

sc = SparkContext("local[5]")
rdd = sc.parallelize(range(50), 5).map(f)
rdd.cache()

%time rdd.count()   # First run, no data cached ~10 s
## CPU times: user 16 ms, sys: 4 ms, total: 20 ms
## Wall time: 11.4 s
## 50

%time rdd.count()  # Second time, task results fetched from cache
## CPU times: user 12 ms, sys: 0 ns, total: 12 ms
## Wall time: 114 ms
## 50

rdd.unpersist()  # Data unpersisted

%time rdd.count()  #  Results recomputed ~10s
## CPU times: user 16 ms, sys: 0 ns, total: 16 ms 
## Wall time: 10.1 s
## 50

While in simple cases like this one persisting behavior is predictable in general caching should be considered a hint not a contract. Task output may be persisted or not depending on available resources and can be evicted from cache without any user intervention.