I have written a script which checks Hadoop block report and if needed it runs hadoop balancer. I read this article about hdfs balancer design and it looks like we need to run HDFS balancer on separate machine so that it will not overload name node.
Please correct me if my this understanding is correct or not.
Now I have setup one separate node and installed Hadoop on it but this machine is not part of the cluster. Data node or task tracker daemons are not running on this machine.
When I run hadoop balancer command on this machine I get only following output:
$ hadoop balancer
Balancing took 135.0 milliseconds
$
I tried to execute start-balancer.sh script directly but I get similar output. Only variation is that this script writes that single line in its .out file.
When I executed hadoop balancer command on Name node I get following output:
ubuntu@master:~$ hadoop balancer
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
14/11/19 06:14:03 INFO net.NetworkTopology: Adding a new node: /default-rack/20.232.273.15:50010
14/11/19 06:14:03 INFO net.NetworkTopology: Adding a new node: /default-rack/20.294.195.28:50010
14/11/19 06:14:03 INFO balancer.Balancer: 0 over utilized nodes:
14/11/19 06:14:03 INFO balancer.Balancer: 0 under utilized nodes:
The cluster is balanced. Exiting...
Balancing took 477.0 milliseconds
From this output it looks like balancer runs only on namenode. So my question is do we always need to run balancer on name node only? Or do we need to do some configuration to make this balancer run on gateway machines?