Anatomy of a File Write in HDFS

Question

Below are the sentences from Hadoop Definitive Guide in "Anatomy of a File Write in HDFS". Am not clear, can someone provide more details on it.

If any datanode fails while data is being written to it, then the following actions are taken, which are transparent to the client writing the data. First, the pipeline is closed, and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets.

Q.) what does this mean "datanodes that are downstream from the failed node will not miss any packets"? can anyone explain more detail.

When the client has finished writing data, it calls close() on the stream. This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgements before contacting the namenode to signal that the file is complete.

Q.) What does "action flushes all the remaining packets to the datanode pipeline"?

Q.) And if client has finished writing data then why does packets still remain and why does it has to flush data nodes.

sumitya sumitya · Accepted Answer · 2016-06-06T19:16:17

Ans.1) In Hadoop there is a concept of replication factor, that decides data pipeline(check my answer on data pipeline) where data to be written. Let say while writing data to 3 nodes, 3rd node fails. Then data to other two nodes should be written properly. They should not be the culprit of other's problem.

Ans.2 and 3) In the process of data write to hdfs, client finishes with data to be written at some point of time, but that doesn't mean full data is actually written to datanodes in the downstream. Datanodes could be waiting for cpu cycle or memory to be available when to write. In this case client calls close() method on the output stream to confirm writing data.

Anatomy of a File Write in HDFS

1 Answers