Why does Hadoop Spilling happens?

Question

I am very new to the Hadoop system and in learning phase.

One thing i noticed in Shuffle and Sort phase that Spill will happen whenever the MapOutputBuffer reaches 80% ( i think this can also be configurable).

Now why spilling phase is required ?

Is it because MapOutputBuffer is a circular buffer and if we don't empty it than it may cause data overwrite and memory leak?

0x0FFF 0x0FFF · Accepted Answer · 2015-01-12T16:56:47

I've written a good article that covers this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/

In general:

Spilling happens when there is not enough memory to fit all the mapper output. Amount of memory available for this is set by mapreduce.task.io.sort.mb
It happens when 80% of the buffer space occupied because the spilling is done in a separate thread, not to interfere with mapper. If the buffer reaches 100% utilization, the mapper thread has to stop and wait for the spilling thread to free up the space. To avoid this, the threshold of 80% is chosen
Spilling happens at least once, when the mapper finished, because the output of the mapper should be sorted and saved to the disk for reducer processes to read it. And there is no use to invent a separate function to the last "save to disk", because in general it does the same task

Why does Hadoop Spilling happens?

1 Answers