0
votes

In MapReduce program, we just set the output path FileOutputFormat.setOutputPath and write the result to a HDFS file using mapper or reducer's context.write(key, value);

How the file writing concept actually works?

  • Mapper/ Reducer will be continuously emiting the records.

    Will each record is sent to HDFS directly?

or

once the application is completed then it will do a copyFromLocal?

or

it create a temporary files in local file system for each mapper or reducer?

http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0

1
Map tasks flush data in local disk ("Spills records" is the name of that). Reduce tasks send data to HDFS.Tuxman
when you say "Reduce tasks send data to HDFS", does it mean Map reduce appends the data to a file?Vijay Innamuri
I don't know the details of the implementation of Map Output, but I remember read somewhere which Map operation write your output in SequenceFile format with io.file.buffer.size size of each file. The combiner will be executed before this, and the sort operation will read these files. But I don't have any reference now.Tuxman

1 Answers

0
votes

Records are written to a byte stream, and flushed periodically to disk on the HDFS. Each record isn't individually written, as that would be a very expensive operation. Also data isn't written to the local file system as again that would be a very expensive operation.

Whenever I have questions about things in Hadoop, I tend to take advantage of its open source nature and delve into the source code. In this case you'd want to take a look at the classes used when outputting data - TextOutputFormat and FSDataOutputStream.