Does Apache Spark cache RDD in node-level or cluster-level?

Question

I know that Apache Spark persist method saves RDDs in memory and that if there is not enough memory space, it stores the remaining partitions of the RDD in the filesystem (disk). What I can't seem to understand is the following:

Imagine we have a cluster and we want to persist an RDD. Suppose node A does not have a lot of memory space and that node B does. Let's suppose now that after running the persist command, node A runs out of memory. The question now is:

Does Apache Spark search for more memory space in node B and try to store everything in memory?

Or given that there is not enough space in node A, Spark stores the remaining partitions of the RDD in the disk of node A even if there some memory space available in node B?

Thanks for your answers.

user10391155 user10391155 · Accepted Answer · 2018-09-20T11:48:34

Normally Spark doesn't search for the free space. Data is cached locally on the executor responsible for a particular partition.

The only exception is the case when you use replicated persistence mode - in that case additional copy will be place on another node.

Does Apache Spark cache RDD in node-level or cluster-level?

2 Answers