When doing a join in spark, or generally for shuffle operations, I can set the maximum number of partitions, in which I want spark to execute this operation.
As per documentation:
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
If I want to lower the amount of work that has to be done in each task, I would have to estimate the total size of data and adjust this parameter accordingly (more partitions means less work done in a single task, but more tasks).
I am wondering, can I tell spark to simply adjust the amount of partitions based on the amount of data? I.e. set the maximum partition size during join operations?
Additional question - how does spark know what is the total size of the datasets to be processed, when doing a repartition into 200 roughly equal partitions?
Thanks in advance!