4
votes

I have a Hadoop cluster that have 8 machines and all the 8 machines are data nodes. There's a program running on one machine(say machine A) that will create sequence files ( each of the file is about 1GB) in HDFS continuously.

Here's the problem: All of the 8 machines are the same hardware and has the same capacity. When other machines still have about 50% free space on the disks for HDFS, machine A has only 5% left. I checked the block info and found that almost every block has one replica on machine A.

Is there any way to balance the replicas? Thanks.

2

2 Answers

1
votes

This is the default placement policy. It works well for the typical M/R pattern, where each HDFS node is also a compute node and the writer machines are uniformly distributed.

If you don't like it, then there is HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS. You need to write a class that implements BlockPlacementPolicy interface, and then set this class in as the dfs.block.replicator.classname in hdfs-site.xml.

1
votes

There is a way. you can use hadoop command line balancer tool. HDFS data might not always be be placed uniformly across the DataNode.To spread HDFS data uniformly across the DataNodes in the cluster, this can be used.

 hadoop balancer [-threshold <threshold>]

where, threshold is Percentage of disk capacity

see the following links for details: