0
votes

I am using Spark SQL API. When I see the Spark SQL section on the spark UI which details the query execution plan it says it scans parquet stage multiple times even though I am reading the parquet only once. Is there any logical explanation?

I would also like to understand the different operations like Hash Aggregate, SortMergeJoin etc and understand the Spark UI better as a whole.

1
Could you add code and screenshots to your question, please? - Vladislav Varslavans

1 Answers

0
votes

If you are doing unions or joins they may force your plan to be "duplicated" since the beginning.

Since spark doesn't keep intermediate states (unless you cache) automatically, it will have to read the sources multiple times

Something like

1- df = Read ParquetFile1
2- dfFiltered = df.filter('active=1')
3- dfFiltered.union(df)

The plan will probably look like : readParquetFIle1 --> union <-- filter <-- readParquetFIle1