4
votes

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?

  • A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
  • B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
  • C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
  • D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
  • E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.
3
Today I comment this question with you doubt. Short answer: The data is written in the Hard disk of the Node when Map task is executed. Not in memory, not in HDFS.Tuxman

3 Answers

3
votes

TaskTracker is a demon responsible for spawning map and reduce workers and it usually resides on a datanode. Map and reduce jobs run in a buffer until a certain threshold is reached; at that point records are written to disk in the background (see Memory Management in Hadoop's MapReduce tutorial). The process of writing to disk after the threshold capacity is reached is also called spill to disk. Thresholds values are given by parameters (e.g. mapreduce.task.io.sort.mb, mapreduce.map.sort.spill.percent, for Map, that can be configured).

Answer A is off because intermediate data may be written to disk.

Answers B and E can be excluded because spilled intermediate data isn't written to HDFS but to the local filesystem.

Finally, D is wrong because the question is asking for intermediate data of the Mapper’s map method. Also, it's not necessary to specify "outside HDFS" because in Hadoop context local filesystem is always understood as non-HDFS.

So, the correct answer is C.

1
votes

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job complete

i think this is the parameter that has to be modified to change the intermediate data location

mapreduce.cluster.local.dir

1
votes

The mapper output is stored on a local filesystem (Not HDFS) of the tasktracker node. So your answer is option "C"