6
votes

I have a basic question regarding file writes and reads in HDFS.

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then isn't the time taken to write each block is more than a traditional file write due to -

  1. the replication factor (default is 3) and
  2. the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:

  1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write (i.e. if the replication is set to 1) + some overhead involved in the distributed communication mechanism.
  2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
3

3 Answers

1
votes

Though your above explanation of a file write is correct, a DataNode can read and write data simultaneously. From HDFS Architecture Guide:

a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline

A write operation takes more time than on a traditional file system (due to bandwidth issues and general overhead) but not as much as 3x (assuming a replication factor of 3).

1
votes

I think your understanding is correct.

One might expect that a simple HDFS client writes some data and when at least one block replica has been written, it takes back the control, while asynchronously HDFS generates the other replicas.

But in Hadoop, HDFS is designed around the pattern "write once, read many times" so the focus wasn't on write performance.

On the other side you can find parallelism in Hadoop MapReduce (which can be seen also an HDFS client) designed explicity to do so.

1
votes

HDFS Write Operation:

There are two parameters

dfs.replication : Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time

dfs.namenode.replication.min : Minimal block replication.

Even though dfs.replication set as 3, write operation will be considered as successful once dfs.namenode.replication.min (default value : 1) has been replicated.

But this replication up to dfs.replication will happen in sequential pipeline. First Datanode writes the block and forward it to second Datanode. Second Datanode writes the block and forward it to third Datanode.

DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the Datanodes in the pipeline.

Have a look at related SE question: Hadoop 2.0 data write operation acknowledgement

HDFS Read Operation:

HDFS read operations happen in parallel instead of sequential like write operations