unless one performs an action on ones RDD after caching it caching will not really happen.
This is 100% true. The methods cache
/persist
will just mark the RDD for caching. The items inside the RDD are cached whenever an action is called on the RDD.
...only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?
You are 100% right again. But I'll elaborate on this a bit.
For easy understanding, consider below example.
rdd.cache()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action
Assume you have 10 documents in your RDD. When the above snippet is run, each document goes through these tasks:
- cached
- map function
- flatMap function
On the other hand,
rdd.cache().count()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action
When the above snippet is run, all the 10 documents are cached first(the whole RDD). Then map function and the flatMap function are applied.
Both are right and are used as per the requirements.
Hope this is makes the things more clear.