I heard Spark SQL is lazy: whenver a result table is referred, Spark recalculates the table :(
For example,
WITH tab0 AS (
-- some complicated SQL that generates a table
-- with size of Giga bytes or Tera bytes
),
tab1 AS (
-- use tab0
),
tab2 AS (
-- use tab0
),
...
tabn AS (
-- use tab0
),
select * from tab1
join tab2 on ...
...
join tabn on ...
...
Spark could recalculate tab0 N times.
To avoid this, it is possible to save tab0 as a temp table. I found two solutions.
1) save tab0 into parquet, then load it into a temp view
https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html How does createOrReplaceTempView work in Spark?
2) make tab0 persistent
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence
Which one is better in terms of query speed?