After reading some online forums and stack overflow questions, what I understood is:
Spilling of data happens when an executor runs out of its memory. And shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it.
I am running spark locally, and I set the spark driver memory to 10g
.
If my understanding is correct, then if a groupBy
operation needs more than 10GB
execution memory it has to spill the data to the disk.
Let suppose a groupBy
operation needs 12GB
memory, as driver memory is set to 10GB
it has to spill nearly 2GB
data to disk, so Shuffle Spill (Disk) should be 2GB and Shuffle spill (memory) should be reaming which is 10GB, because shuffle spill (memory) is size of data in memory at the time of spill.
If my understanding is correct then Shuffle spill (memory) <= Executor memory
. In my case it is driver memory as I am running spark locally.
But it seems like I am missing something, below is the values from spark ui.
Total Time Across All Tasks: 41 min
Locality Level Summary: Process local: 45
Input Size / Records: 1428.1 MB / 42783987
Shuffle Write: 3.8 GB / 23391365
Shuffle Spill (Memory): 26.7 GB
Shuffle Spill (Disk): 2.1 GB
Even though I set the spark driver memory to 10g
, how could be memory shuffle spill more than memory allocated to driver.
I observed the memory consumption in windows task manger it never exceeded 10.5GB
while job was running, then how could possibly Shuffle Spill (Memory) is 26.7 GB.
DAG:
Event timeline: 45 Tasks because of 45 partitions for 4.25GB data.
Here is the code that I am trying to run which is solution for my previous problem.