1
votes

I have a Spark batch job that consumes data from a Kafka topic with 300 partitions. As part of my job, there are various transformations like group by and join which require shuffling.

I want to know if I should go with the default value of spark.sql.shuffle.partitions which are 200 or set it to 300 which is the same as the number of input partitions in Kafka and hence the number of parallel tasks spawn to read it.

Thanks

1

1 Answers

0
votes

In the chapter on Optimizing and Tuning Spark Applications of the book "Learning Spark, 2nd edition" (O'Reilly) it is written that the default value

"is too high for smaller or streaming workloads; you may want to reduce it to a lower value such as the number of cores on the executors or less.

There is no magic formula for the number of shuffle partitions to set for the shuffle stage; the number may vary depending on your use case, data set, number of cores, and the amount of executors memory available - it's a trial-and-error-approach."

Your goal should be to reduce the amount of small partitions being sent accross the network to executor's task.

There is a recording of a talk on Tuning Apache Spark for Large Scale Workloads which also talks about this configuration.

However, when you are using Spark 3.x, you do not think about that much as the Adaptive Query Execution (AQE) framework will dynamically coalesce shuffle partitions based on the shuffle file statistics. More details on the AQE framework are given in this blog.