How job client in hadoop compute inputSplits

Question

I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:

How the JObClient Computes the input Splits on the data?

According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.

Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?

Apologies if my question is not clear. I am a beginner. :)

Thomas Jungblut Thomas Jungblut · Accepted Answer · 2013-04-18T09:10:13

No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:

Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and computed input splits) to the HDFS.

The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.

Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input. The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.

This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.

How job client in hadoop compute inputSplits

2 Answers