How to separate Hadoop MapReduce from HDFS?

Question

I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.

Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?

Thanks!

Razvan Razvan · Accepted Answer · 2012-07-07T01:06:06

You don't specify this kind of options in the configuration files. What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).

I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.

If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.

In order to achive what you've said, I would do like this:

Modify the masters and slaves files as this: Master file should contain the name of machine1 Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this: Master file should contain the machine1 Slaves file should contain machine1
Run start-dfs.sh

I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!

How to separate Hadoop MapReduce from HDFS?

2 Answers