2
votes

Is this the way Hadoop works?

  1. Client submit a MapReducer job/program to NameNode.

  2. JobTracker (resides on NameNode) allocates task to the slave task trackers that are running on individual worker machines(date nodes)

  3. Each Tasktracker is responsible to execute and manage the individual tasks assigned by Job Tracker

According to above scenario MapReducer program will run on slave node. Does it means that Job is going to consume Slave computation Engine or Processing Power?.

What if I want to use another machine (independent to Hadoop installation system) to execute MapReduce job and uses Hadoop Clusters data?

Why should I use Hadoop clusters? Hadoop distribute the large data in a very efficient way to their DataNode(s) .

The new scenario would be as follow:

a. Server

b. Client

a.1 ) Distribute the un-ordered data using Hadoop Clusters

b.1) Client will execute (not submitted to NameNode) a MapReducer job which is getting data from Hadoop Clusters datanode. If it's possible then what will happen to JobTracker (NameNode) and Tasktracker (DataNode) ?

I am ignoring the major part of Hadoop over here here by executing the job at client machine but that is my project requirement. Any suggestion on it?

5

5 Answers

2
votes

You are right in the first part. Firstly the architecture with jobTracker and TaskTracker is for Hadoop 1. You should look to Hadoop 2 which is the most recent architecture.

You have confusion with HDFS and MapReduce.

  • HDFS : It is the distributed file system of Hadoop. The NameNode is the master of the cluster. It contains the metadata and the localisation of the files. The DataNodes are the slaves of the cluster. They store the data accross the cluster.

  • MapReduce : So the "new" architecture is called Yarn and run like that : You have a master role, the RessourceManager and some slave, the nodemanagers. When you submit a MapReduce jar to the cluster, the RessourceManager will allocate the process to nodemanager. To simplify, each nodemanager will execute the programm on a part of a file stored in HDFS.

So just correctly separate HDFS role and MapReduce role.

2
votes

Hadoop is a framework to store, analyze and process big data in volumes of tera bytes/peta bytes.

Storage:

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. Have a look at this HDFS architecture

enter image description here

Processing

Map reduce is framework to process distributed data. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. Have a look at this Yarn architecture

enter image description here

Summary:

Name Node + Data Node are daemons for Storage

And

Resource Manager + Scheduler + Node Manager + Application Master are daemons for processing.

All these daemons can run in different nodes. If Data Node + Node Manager runs in same node for the data available in data node, performance is improved with Data Locality. If Data Node and Node Manager processes run on different nodes and node manager to act on data stored in different Data Node, data should be transferred over network and involves small processing overhead.

0
votes

Yes it is possible to run task trackers on separate machines as the ones running as data nodes. Each of the task trackers needs to know where is the data node which hosts the data block.But for best practices,the data nodes themselves are assigned as task trackers.

0
votes

Yes, it is possible but there is no direct configuration. You would need to change start scripts. Instead of starting all demons using start-all script, you would need to start hdfs demons on separate node than map reduce demon node.

-2
votes

No it does not work like that. I would read a couple of Hadoop books before embarking on a serious project. Start with Michael White's " Hadoop, the Definitive Guide".