0
votes

I am new to Spark SQL. I have a question on partition usage during Joins

Assume that there is a table named test1 that saved on 10 partitions (parquet) files. Also assume that spark.sql.shuffle.partitions = 200.

Question: If test1 is used to Join to another table, will Spark perform the operation using 10 partitions (which is the number of partitions the table resides), or will it repartition the table anyway in 200 partitions (as per shuffle partition value) and then perform the join ? in which case the join will yield better performance. If the answer is that the join will be performed using the 10 partitions, isn't it better to always repartition (CLUSTER BY) the joining table to a higher number of partitions for better Join performance ?

In the Spark UI I have seen some stages using only 10 tasks, while other stages using 200 tasks.

Can someone please help me understand.

Thanks

2

2 Answers

0
votes

Spark will use 200 partitions in most cases (SortMergeJoin,ShuffleHashJoin), unless spark estimates your table to be small enough for a BroadcastHashJoin

0
votes

Spark will read the data in 10 partitions on 10 tasks and similarly it would read the other data frame partitions used in the join and once it has all the data it would create 200 partitions which is the default value for shuffle partitions. So this is the reason you see 10 tasks in one stage, then some other tasks in different stage and then finally 200 tasks after the shuffle operation. So at last after join you would have 200 partitions by default unless you have set that to different value in spark configuration.