Join Partitions in Spark SQL for better performance

Question

I am new to Spark SQL. I have a question on partition usage during Joins

Assume that there is a table named test1 that saved on 10 partitions (parquet) files. Also assume that spark.sql.shuffle.partitions = 200.

Question: If test1 is used to Join to another table, will Spark perform the operation using 10 partitions (which is the number of partitions the table resides), or will it repartition the table anyway in 200 partitions (as per shuffle partition value) and then perform the join ? in which case the join will yield better performance. If the answer is that the join will be performed using the 10 partitions, isn't it better to always repartition (CLUSTER BY) the joining table to a higher number of partitions for better Join performance ?

In the Spark UI I have seen some stages using only 10 tasks, while other stages using 200 tasks.

Can someone please help me understand.

Thanks

Raphael Roth Raphael Roth · Accepted Answer · 2021-01-29T19:38:52

Spark will use 200 partitions in most cases (SortMergeJoin,ShuffleHashJoin), unless spark estimates your table to be small enough for a BroadcastHashJoin

Join Partitions in Spark SQL for better performance

2 Answers