2
votes

I am new to spark so am following this amazing tutorial from sparkbyexamples.com and while reading I found this section:

Shuffle partition size & Performance

Based on your dataset size, a number of cores and memory PySpark shuffling can benefit or harm your jobs. When you dealing with less amount of data, you should typically reduce the shuffle partitions otherwise you will end up with many partitioned files with less number of records in each partition. which results in running many tasks with lesser data to process.

On other hand, when you have too much of data and having less number of partitions results in fewer longer running tasks and some times you may also get out of memory error.

Getting the right size of the shuffle partition is always tricky and takes many runs with different values to achieve the optimized number. This is one of the key properties to look for when you have performance issues on PySpark jobs.

Can someone help me understand how do you determine how many shuffle partitions you will need for your job?

1

1 Answers

2
votes

As you quoted, it’s tricky, but this is my strategy:

If you’re using “static allocation”, means you tell Spark how many executors you want to allocate for the job, then it’s easy, number of partitions could be executors * cores per executor * factor. factor = 1 means each executor will handle 1 job, factor = 2 means each executor will handle 2 jobs, and so on

If you’re using “dynamic allocation”, then it’s trickier. You can read the long description here https://databricks.com/blog/2021/03/17/advertising-fraud-detection-at-scale-at-t-mobile.html. The general idea is you need to answer many questions like what’s the size if your data (how big in terms of gigabytes), how its structure looks like (how many files, how many folders, how many rows etc), how would you read it (from hdfs or from hive or from jdbc), how much resources do you have (cores, executors, memory), … Then you run and benchmark over and over to find the sweet spot that is “just right” for your circumstances.

Update #1:

So what is the general industry practice, will a company simply use first tactic and allocate more hardware or they will use dynamic allocation?

Usually, if you have an on-premise Hadoop environment, you can choose between static (default mode) and dynamic allocation (advanced mode). Also, I often start with dynamic because I have no idea how big the data and its transformation is, so stick with dynamic give me flexibility to expand my work without thinking too much about Spark configuration. But you also can start with static if you want to, nothing preventing you to do so.

Then eventually, when it came to productionize process, you also can choose between static (very stable but consumes more resources) vs dynamic (less stable, i.e fail sometimes due to resources allocation, but save resources.

Finally, most Hadoop cloud solution (like Databricks) come with dynamic allocation by default, which is is less costly.