MapReduce intermediate data output location

Question

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?

A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

Today I comment this question with you doubt. Short answer: The data is written in the Hard disk of the Node when Map task is executed. Not in memory, not in HDFS. — Tuxman

user2314737 user2314737 · Accepted Answer · 2018-02-04T21:53:19

TaskTracker is a demon responsible for spawning map and reduce workers and it usually resides on a datanode. Map and reduce jobs run in a buffer until a certain threshold is reached; at that point records are written to disk in the background (see Memory Management in Hadoop's MapReduce tutorial). The process of writing to disk after the threshold capacity is reached is also called spill to disk. Thresholds values are given by parameters (e.g. mapreduce.task.io.sort.mb, mapreduce.map.sort.spill.percent, for Map, that can be configured).

Answer A is off because intermediate data may be written to disk.

Answers B and E can be excluded because spilled intermediate data isn't written to HDFS but to the local filesystem.

Finally, D is wrong because the question is asking for intermediate data of the Mapper’s map method. Also, it's not necessary to specify "outside HDFS" because in Hadoop context local filesystem is always understood as non-HDFS.

So, the correct answer is C.

MapReduce intermediate data output location

3 Answers