I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt")
is issued, nothing happens to the data, only a HadoopRDD
is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD
that contains a reference to testFile
and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count
, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap
function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache()
is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count
will be computed, all in memory - if the data fits in memory.
cache
is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache
statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache
is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache
when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.