Optimize tables used more than once: make dataframe persistent or save as parquet

Question

I heard Spark SQL is lazy: whenver a result table is referred, Spark recalculates the table :(

For example,

WITH tab0 AS (
   -- some complicated SQL that generates a table 
   -- with size of Giga bytes or Tera bytes
), 

tab1 AS (
   -- use tab0
),

tab2 AS (
   -- use tab0
),

...

tabn AS (
   -- use tab0
),

select * from tab1 
join tab2 on ...
...
join tabn on ...
...

Spark could recalculate tab0 N times.

To avoid this, it is possible to save tab0 as a temp table. I found two solutions.

1) save tab0 into parquet, then load it into a temp view

https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html How does createOrReplaceTempView work in Spark?

2) make tab0 persistent

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence

Which one is better in terms of query speed?

Aneel Aneel · Accepted Answer · 2019-03-10T04:39:48

If you have enough memory to hold the data, using dataFrame.cache() will be faster than writing to disk as Parquet and using a TempView to access it. The TempView will potentially go to disk N times.

If you don't have enough memory, I'd benchmark and see if there's a difference between persisting with the MEMORY_AND_DISK storage level and between writing to Parquet. It's hard for me to imagine that the disk format that Spark uses would be less efficient than Parquet (because why not just use Parquet in that case?), but I've learned to be careful with my assumptions when optimizing Spark code.

Optimize tables used more than once: make dataframe persistent or save as parquet

1 Answers