in spark streaming must i call count() after cache() or persist() to force caching/persistence to really happen?

Question

watching this very good video on spark internals the presenter says that unless one performs an action on ones RDD after caching it caching will not really happen.

I never see count() being called in any other circumstances. So, I'm guessing that he is only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?

code code · Accepted Answer · 2017-05-02T08:02:27

unless one performs an action on ones RDD after caching it caching will not really happen.

This is 100% true. The methods cache/persist will just mark the RDD for caching. The items inside the RDD are cached whenever an action is called on the RDD.

...only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?

You are 100% right again. But I'll elaborate on this a bit.

For easy understanding, consider below example.

rdd.cache()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action

Assume you have 10 documents in your RDD. When the above snippet is run, each document goes through these tasks:

cached
map function
flatMap function

On the other hand,

rdd.cache().count()  
rdd.map(...).flatMap(...)  //and so on
rdd.count()  //or any other action

When the above snippet is run, all the 10 documents are cached first(the whole RDD). Then map function and the flatMap function are applied.

Both are right and are used as per the requirements. Hope this is makes the things more clear.

in spark streaming must i call count() after cache() or persist() to force caching/persistence to really happen?

2 Answers