0
votes

I'm looking for some specific information regarding the chain of events when running a MapReduce job on a Hadoop cluster.

Let's assume that my Reduce tasks are on the verge of completion. After my last reducer has written its output to the output file, how many replicas of the output file are there? What exactly happens after the last reducer has finished writing to the output file. When does the NameNode request the respective Data Nodes to replicate the output file? And how is the Name Node informed that the output file is ready? Who conveys that information to the NameNode?

Thank you!

2

2 Answers

3
votes

The Reduce tasks write output to HDFS. They do this by first communicating with the name node to request a block. The name node then tells the reducer which data nodes to write to, and then the reducer actually sends the data directly to the first data node, which then sends it to the second data node, which sends it to the third node. Typically the name node will keep things local, so the first data node is probably the same machine that is running the reduce task.

Once the reducer has finished writing outputs, and the data nodes have confirmed this, the reducer itself will tell the job tracker that it has finished via periodic heartbeat communication.

1
votes

To understand the basics of HDFS replication, have a read over replica placement in the HDFS architecture document. In a nutshell, the NameNode will try to use the same rack to minimize latency.