If you cache a very large dataset that cannot be all stored in memory or on disk how does spark handle the partial cache? how does it know which data needs to be recomputed when you go to use that dataframe again?
Example:
- Read 100 GB dataset into memory
df1
- Compute new dataframe
df2
based ondf1
- cache
df2
If spark can only fit 50GB of Cache for df2
what happens if you go to reuse df2
for the next steps? How would spark know which data it doesn't need to recompute and which is does? Will it need to re-read that data again that it couldn't persist?
UPDATE
What happens if you have 5GB memory and 5GB disk and try to cache a 20GB dataset? What happens to the other 10GB of data that can't be cached and how does spark know which data it needs to recompute and which it doesn't?