For starters eager persistence would pollute a whole pipeline. cache
or persist
only expresses intention. It doesn't mean we'll ever get to the point when RDD is materialized and can be actually cached. Moreover there are contexts where data is cached automatically.
Because either ways I go, eagerly or lazily, it is going to persist entire RDD as per Storage level.
It is not exactly true. Thing is, persist
is not persistent. As it is clearly stated in the documentation for MEMORY_ONLY
persistence level:
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
With MEMORY_AND_DISK
remaining data is stored to the disk but still can be evicted if there is not enough memory for subsequent caching. What is even more important:
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.
You can also argue that cache
/ persist
is semantically different from Spark actions which are executed for specific IO side-effects. cache
is more a hint for a Spark engine that we may want to reuse this piece of code later.