5
votes

I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.

According to the following schema :

HDFS architecture

whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).

However, I don't understand how this is managed. Indeed, according to the same documentation :

The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Is the schema presented above correct ? If so,

  • is the namenode only informed of new files when it receives a Blockreport (which can take time, I suppose) ?

  • why does the client write to multiple nodes ?

    If this schema is not correct, how is file creation working with HDFs ?

1

1 Answers

2
votes

As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.

Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.

According to this article:

Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, ... Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.

So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.

For your question:

why does the client write to multiple nodes ?

I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes