I am not sure whether I should increase or decrease the number of partitions when doing an aggregate operation. Let's say I am using pyspark dataframes. pyspark 1.6.1
.
I know that typically row transforms require a greater number of partitions. and saving data to disk typically requires fewere partitions.
But, for an aggregation it is not clear to me what to do in pyspark
??
Arguments for increasing number of partitions: Since we have to shuffle data around for an aggregation, you want to shuffle less data around and hence increase the number of partitions, in order to decrease the size of the partitions.
Arguments for decreasing the number of partitions: IT requires a lot of overhead to gather and compute on each partitions. So, too many partitions will result in too much overhead and the pyspark job can timeout.
Which one is it?
Sources: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/