1
votes

spark sql aggregation operation which shuffles data i.e spark.sql.shuffle.partitions 200(by default). what happens on performance when shuffle partition is greater than 200.

Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000. so if number of partitions is near to 2000 then increase it to more than 2000.

but my question is what will be the behavior when shuffle partition is greater than 200(lets say 300).

1

1 Answers

2
votes

The number 200 was selected as default based on the typical workloads on the relative big clusters with enough resources allocated for jobs. Otherwise this number should be selected based on the 2 factors - number of available cores, and partition size (it's recommended to keep partitions close to 100Mb). The selected number of partitions should be the multiply of the number of available cores, but shouldn't be very big (typically it's 1-3 x of number of cores). If number of partitions is greater than default, shouldn't change Spark's behavior - it will just increase the number of tasks that Spark will need to execute).

You can watch this talk from Spark + AI Summit 2019 - it covers a lot of details on the optimization of the Spark programs, including selection of the number of partitions.