1
votes

I need to understand if there is any difference between the below two approaches of caching while using spark sql and is there any performance benefit of one over the another (considering building the dataframes are costly and I want to reuse it many times/hit many actions) ?

1> Cache the original data frame before registering it as temporary table

df.cache()

df.createOrReplaceTempView("dummy_table")

2> Register the dataframe as temporary table and cache the table

df.createOrReplaceTempView("dummy_table")

sqlContext.cacheTable("dummy_table")

Thanks in advance.

1

1 Answers

1
votes

df.cache() is a lazy cache, which means that the cache would only occur when the next action is triggered.

sqlContext.cacheTable("dummy_table") is an eager cache, which mean the table will get cached as the command is called. An equivalent of this would be: spark.sql("CACHE TABLE dummy_table")

To answer your question if there is a performance benefit of one over another, it will be hard to tell without understand your entire workflow and how (and where) your cached dataframes are used. I'd recommend using the eager cache, so you won't have to second guess when (and whether) your dataframe is cached.