0
votes

If you cache a very large dataset that cannot be all stored in memory or on disk how does spark handle the partial cache? how does it know which data needs to be recomputed when you go to use that dataframe again?

Example:

  1. Read 100 GB dataset into memory df1
  2. Compute new dataframe df2 based on df1
  3. cache df2

If spark can only fit 50GB of Cache for df2 what happens if you go to reuse df2 for the next steps? How would spark know which data it doesn't need to recompute and which is does? Will it need to re-read that data again that it couldn't persist?

UPDATE

What happens if you have 5GB memory and 5GB disk and try to cache a 20GB dataset? What happens to the other 10GB of data that can't be cached and how does spark know which data it needs to recompute and which it doesn't?

1
Spark uses partitions, hard to answer your question. May be add your comments to the question?thebluephantom

1 Answers

0
votes

Spark has this default option for DF and DS:

MEMORY_AND_DISK – This is the default behavior of the DataFrame or Dataset. In this Storage Level, The DataFrame will be stored in JVM memory as a deserialized objects. When required storage is greater than available memory, it stores some of the excess partitions into local disk and reads the data from local disk when it required. It is slower as there is I/O involved.

However, to be more specific:

Spark's unit of processing is a partition = 1 task. So the discussion is more about partition or partitions fitting into memory and/or local disk.

If a partition of the DF doesn't fit in memory and disk when using StorageLevel.MEMORY_AND_DISK, then the OS will fail, aka kill, the Executor / Worker. Eviction of other partitions than your own DF may occur, but not for your own DF. The .cache is either successful or not, there is no re- reading in this case.

I base this on the fact that partition eviction does not occur for partitions belonging to the same underlying RDD. Not well explained all this stuff, but see here: How does Spark evict cached partitions?. In the end other RDD partitions may be evicted and re-computed but in the end also yo need enough local disk as well as memory.

A good read is: https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/