1
votes

I am not sure whether I should increase or decrease the number of partitions when doing an aggregate operation. Let's say I am using pyspark dataframes. pyspark 1.6.1.

I know that typically row transforms require a greater number of partitions. and saving data to disk typically requires fewere partitions.

But, for an aggregation it is not clear to me what to do in pyspark??

Arguments for increasing number of partitions: Since we have to shuffle data around for an aggregation, you want to shuffle less data around and hence increase the number of partitions, in order to decrease the size of the partitions.

Arguments for decreasing the number of partitions: IT requires a lot of overhead to gather and compute on each partitions. So, too many partitions will result in too much overhead and the pyspark job can timeout.

Which one is it?

Sources: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

1

1 Answers

2
votes

Well it depends,

Working with user defined partitions according the problem make somethings easier and other a little bit harder. But here it goes what I already have in my experiences.

Setting more partitions

I used this approach when an aggregation function followed by an enrichment of data appears. What was happening with the default data partitioning. I was getting an OOM error and other few problems due to that. So my aggregation with my enrichment of data was using more memory than my workers could support. The solution was increasing the number of partitions for that step and solved my problem, but it took more time of execution due to shuffle and other stuffs.

Setting less partitions

This case was about the shuffle time, I had a cluster of Cassandra and spark together, and with datasax connector I was trying to read some data from Cassandra with the 200 default partitions. But all the data was in the same machine, that was creating a lot of shuffle when I was doing one simple aggregation. So I reduce the partitions and it reduced the shuffle time.

Conclusion

You need to understand your data and what you want to do. There is no magic about data processing. You need to check what do you need to do and how. And it will help to choose what to do, or increasing or decreasing the partitions.