2
votes

While mapreduce job runs the map task results are stored in local file system and then final results from reducer are stored in hdfs. The question is

  1. What is the reason that map task results being stored in local file system ?
  2. In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
3

3 Answers

2
votes

1) Mapper output is stored in local fs because, in most of the scenarios we are interested in output given by Reducer phase(which is also known as final output).Mapper <K,V> pair is intermediate output which is of least importance once passed to Reducer. If we store Mapper output in hdfs, it will be a waste of storage, because, hdfs have replication factor(by default 3) and hence 3 times the space will be taken by data which is not at all required in further processing.

2) In case of map only job, final output is stored in hdfs.

1
votes

1) After TaskTracker(TT) mapper logic is done, before sending the output to Sort and Shuffle phase, the TT is going to store the o/p in temporary files(LFS) This is to avoid starting the entire MR job again incase of network glitch.Once stored in LFS, the mapper output can be picked directly from LFS.This data is called Intermediate data and the concept is called Data Localization

This intermediate data will be deleted once the job is completed.Otherwise, the LFS would grow in size with Intermediate data from different jobs as time progresses.

Data Localization is only applicable for Mapper phase but not for Sort & Shuffle,Reducer phases

2) When there is no reducer phase, the Intermediate Data would eventually be pushed onto HDFS.

0
votes

What is the reason that map task results being stored in local file system ?

Mapper output is temporary output and is relevant only for Reducer. Storing temporary output in HDFS (with replication factor) is overkill. Due to this reason, Hadoop framework stores output of Mapper into local file system instead of HDFS system. It saves lot of disk space.

One more important point from Apache tutorial page :

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.

The Mapper outputs are sorted and then partitioned per Reducer

In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?

You can more details about this query from Apache tutorial page.

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

If number of Reducers are greater than 0, mapper outputs are stored in local file system and sorted before sending them to Reducer. If number of Reducers are 0, then mapper outputs are stored in HDFS without sorting.