2
votes

I develop Map/Reduce using Hadoop. My driver Program submit a MapReduce job (with a Map and Reduce Task) to the Job tracker of Hadoop. I have two questions: a) Can my Map or reduce task submit another MapReduce Job? (with the same cluster Hadoop and to the same Job Tracker). That means, my begining driver program submit a mapreduce job in which, its map or reduce task spawn another MapReduce job and submit it to the same cluster Hadoop and to the same Job Tracker. I think it's possible. But I'am not sure. Moreover, it a good solution? If not, can we have another solution?

b) Can we use two Map tasks (with two different functions and one Reduce task in a MapReduce job? Thanks a lot

5
What is it you're trying to accomplish by launching MapReduce jobs from within a MapReduce job?Pradeep Gollakota
I have two input large data sets (set1 and set2). For each record element of set1, I need all elements of sets 2 in order to process it. So I intend to let my driver program submit set1 as input data to mapreduce job. Then, in Map Task, in order to process a record element of set1, I intend to submit another mapReduce job whose input data is set2. I don't know it is possible or not. I think it's possible theoretically but impossible because no slot is available. It is possible if my Map function submit another MapReduce Job to another Hadoop cluster with another JobTracker?CD Tran

5 Answers

1
votes

You can certainly chain multiple map stages using the ChainMapper class

You can also setup dependencies between jobs using the JobControl class and addDependingJob() method. This may be preferable to having Map Reduce jobs spawn off other Map Reduce jobs which goes against the fundamental approach of Map Reduce as it will likely cause your solution to no longer be robust against hardware failure on an individual node.

Chapter 5 of Hadoop in Action by Chuck Lam has a good overview of this.

0
votes

No , i dont think its possible. Alternate solution is to launch a singe MapReduce task with input as set1 and set2 , and in Map phase, add the if condition that if a tuple read is from set 1 , add it to arraylist1 , and if from set 2 , add it to arraylist2. Then you do whatever you want with these two arraylists!

0
votes

You should look into Cascading which is sort of made to chain (or "cascade") the outputs of one mapreduce job into another. It abstracts away a lot of the grunt work needed to make that happen and allows the developer to write on a much higher level to make complex multi step mapreduce jobs.

0
votes

I would suggest you looking at Oozie framework.

0
votes
  1. Its possible to launch a MR from another MR . oozie job launcher launches any action (pig , java , MR) using map as launcher .

  2. User "MultiInputs" API to define different maps for different input paths but use the same reducer . It is a classical example of performing "Joins" https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html