I am new to Spark SQL. I have a question on partition usage during Joins
Assume that there is a table named test1
that saved on 10
partitions (parquet) files. Also assume that spark.sql.shuffle.partitions = 200
.
Question:
If test1 is used to Join
to another table, will Spark perform the operation using 10
partitions (which is the number of partitions the table resides), or will it repartition the table anyway in 200
partitions (as per shuffle partition value) and then perform the join ? in which case the join will yield better performance. If the answer is that the join will be performed using the 10
partitions, isn't it better to always repartition (CLUSTER BY
) the joining table to a higher number of partitions for better Join performance ?
In the Spark UI I have seen some stages
using only 10
tasks
, while other stages
using 200
tasks.
Can someone please help me understand.
Thanks