HDFS replication and data distribution

Question

I have a Hadoop cluster with 4 DataNodes. I am confused between the two issues : data replication and data distribution.

Suppose that I have a 2 GB file and my replication factor is 2 & block size is 128 MB. When I put this file into hdfs, I see that 2 copies of each 128 MB blocks are created and they are placed in datanode3 and datanode4. But datanode2 & datanode1 are not used. The data is replicated because of the replication factor but I expect to see some data blocks in datanode1 and datanode2. Is something wrong?

Let's say that I have 20 DataNodes and replication factor is 2. If I put a file (2 GB) on HDFS, I again expect to see two copies of each 128 MB but also expect to see these 128 MB blocks are distributed between 20 DataNodes.

PradeepKumbhar PradeepKumbhar · Accepted Answer · 2016-08-16T08:05:22

Ideally, the 2GB file should get distributed among all the available DataNodes.

File Size: 2GB = 2048MB
Block Size: 128MB
Replication Factor: 2

With above configuration you should have: 2048 / 128 * 2 blocks i.e. 32 blocks. And these blocks should get distributed almost equally between all DataNodes. Considering you have 4 DataNodes, each of them should have around 8 blocks.

The reason I could think of for not having above situation is if the DataNodes are down. Check if all the DataNodes are up: sudo -u hdfs hdfs dfsadmin -report

HDFS replication and data distribution

1 Answers