Why is Hadoop not using the other hard disks?

Question

this is my first time with playing around with a Hadoop cluster, so I'm very new at this.

I've got a small cluster of 3 nodes with 5 x 2 TB hard drives in each computer. All are running Ubuntu, have the same hardware specs, and are using Apache Hadoop 1.0.4. The hard disks are mounted as /media/diskb, /media/diskc, /media/diskd, etc. on each respective computer and are configured as JBOD.

All 3 computers are serving as Data Nodes and Task Trackers, while one is the master Name Node and Secondary Name Node, the 2nd is the Job Tracker, and the 3rd is a pure slave (DT/TT) node.

In each computer's hdfs-site.xml file, I have listed the mount points, comma-separated, no spaces as values.

<property>
 <name>dfs.datanode.data.dir</name>
 <value>/data/dfs/data,/media/diskb/data/dfs/data,/media/diskc/data/dfs/data,..</value>
</property>

For the Name Node:

<property>
 <name>dfs.namenode.name.dir</name>
 <value>/data/dfs/name,/media/diskb/data/dfs/name,/media/diskc/data/dfs/name,..</value>
</property>

In mapred-site.xml:

<property>
 <name>mapred.local.dir</name>
 <value>/data/mapred/local,/media/diskb/data/mapred/local,/media/diskc/data/mapred/local,...</value>
</property>

Also, in core-site.xml

<property>
 <name>hadoop.tmp.dir</name>
 <value>/media/diskb/data</value>
</property>

(I've played around with changing the temp directory to be assigned to a disk at a time to check permissions, etc and Hadoop works fine)

Permissions for the mounts and ownership of the directories are full for the Hadoop user account. When I run a map/reduce program, I can see Hadoop create resource folders inside the extra disks on each node under their mapred/local directories, but I don't see the same happening for the data node directories and the configured capacity reported on the administration page (namenode:50070) is at: 5.36 TB (1.78 TB for each node).

Why is Hadoop not using every hard disk which should be a combined capacity of 26.7 TB?

Also I don't see a performance increase in running a Map/Reduce job utilizing all disks vs just using 1 disk on each node. What should I be expecting?

Thank you!

user1767125 user1767125 · Accepted Answer · 2013-02-06T04:32:13

Ok, really simple answer: dfs.namenode.name.dir should be dfs.name.dir and dfs.datanode.data.dir should be dfs.data.dir

I thought they (dfs.name.dir, dfs.data.dir) were deprecated, but apparently not. So Hadoop was going by the defaults I set in core-site.xml, hence only 3 drives being used.

Why is Hadoop not using the other hard disks?

1 Answers