3
votes

I know that Apache Spark persist method saves RDDs in memory and that if there is not enough memory space, it stores the remaining partitions of the RDD in the filesystem (disk). What I can't seem to understand is the following:

Imagine we have a cluster and we want to persist an RDD. Suppose node A does not have a lot of memory space and that node B does. Let's suppose now that after running the persist command, node A runs out of memory. The question now is:

Does Apache Spark search for more memory space in node B and try to store everything in memory?

Or given that there is not enough space in node A, Spark stores the remaining partitions of the RDD in the disk of node A even if there some memory space available in node B?

Thanks for your answers.

2

2 Answers

2
votes

Normally Spark doesn't search for the free space. Data is cached locally on the executor responsible for a particular partition.

The only exception is the case when you use replicated persistence mode - in that case additional copy will be place on another node.

1
votes

The closest thing I could find is this To cache or not to cache. I had plenty of situations when data was mildly skewed and was getting memory related exceptions/failures when trying to cache/persist into RAM, one way around it was to use StorageLevels like MEMORY_AND_DISK, but obviously it was taking longer to cache and than read those partitions.

Also in Spark UI you can find the information about executors and how much of their memory is used for caching, you can experiment and monitor how it behaves.