How does Spark evict cached partitions?

Question

I'm running Spark 2.0 in stand-alone mode, and I'm the only one submitting jobs in my cluster.

Suppose I have an RDD with 100 partitions and only 10 partitions in total would fit in memory at a time.

Let's also assume that allotted execution memory is enough and will not interfere with storage memory.

Suppose I iterate over the data in that RDD.

rdd.persist()  // MEMORY_ONLY

for (_ <- 0 until 10) {
  rdd.map(...).reduce(...)
}

rdd.unpersist()

For each iteration, will the first 10 partitions that are persisted always be in memory until rdd.unpersist()?

Stefan Repcek Stefan Repcek · Accepted Answer · 2017-03-08T00:36:38

For now what I know Spark is using LRU (Less Recently Used) eviction strategy for RDD partitions as a default. They are working on adding new strategies. https://issues.apache.org/jira/browse/SPARK-14289

This strategy remove an element which is less recently used The last used timestamp is updated when an element is put into the cache or an element is retrieved from the cache.

I suppose you will always have 10 partition in your memory, but which are stored in memory and which will get evicted depends on their use. According Apache FAQ:

Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

Thus, it depends on your configuration if other partitions are spilled to disk or recomputed on the fly. Recomputation is the default, which is not always most efficient option. You can set a dataset's storage level to MEMORY_AND_DISK to be able to avoid this.

How does Spark evict cached partitions?

2 Answers