0
votes

Consider an example:

I have a cluster with 5 nodes and each node has 64 cores with 244 GB memory.

I decide to run 3 executors on each node and set executor-cores to 21 and executor memory of 80GB, so that each executor can execute 21 tasks in parallel. Now consider that 315(63 * 5) partitions of data, out of which 314 partitions are of size 3GB but one of them is 30GB(due to data skew).

All of the executors that received the 3GB partitions have 63GB(21 * 3 = since each executor can run 21 tasks in parallel and each task takes 3GB of memory space) occupied.

But the one executor that received the 30GB partition will need 90GB(20 * 3 + 30) memory. So will this executor first execute the 20 tasks of 3GB and then load 30GB task or will it just try to load 21 tasks and find that for one task it has to spill to disk? If I set executor-cores to just 15 then the executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72 gb and hence won't spill to disk.

So in this case will reduced parallelism lead to no shuffle spill?

1

1 Answers

0
votes

@Venkat Dabri ,

Could you please format the questions with appropriate carriage return/spaces ?

Here are few pointers

Spark (Shuffle)Map Stage ==> the size of each partition depends on filesystem's block size. E.g. if data is read from HDFS , each partition will try to have data as close as 128MB so for input data number of partitions = floor(number of files * blocksize/128 (actually 122.07 as Mebibyte is used))

Now the scenario you are describing is for Shuffled data in Reducer(Result Stage)

Here the blocks processed by reducer tasks are called Shuffled Blocks and By default Spark ( for SQL/Core APIs) will launch 200 reducer tasks

Now important thing to remember Spark can hold Max 2GB so if you have too few partitions and one of them does a remote fetch of a shuffle block > 2GB, you will see an error like Size exceeds Integer.MAX_VALUE

To mitigate that , within default limit Spark employs many optimization (compression/tungsten-sort-shuffle etc) but as a developer we can try to repartition skewed data intelligently and tune default parallelism