1
votes

I currently have a task where i need to chain a few jobs in Hadoop. What i am dong right now is that i have 2 jobs. My first job has a map function,a combiner and a reducer. Well i need one more phase of reduce so i created a second job with a simple map task that passes the output of the previous reducer to the final reducer. I find that this is a bit "stupid" because there has to be a way to simply chain this. Moreover i think the I/Os would be decreased that way.

I am using the 0.20.203 version and i only find deprecated examples of ChainMapper and ChainReducer using JobConf. I have found these: http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainMapper.html http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainReducer.html that seems to work with Job class and not with the JobConf which is deprecated in 203, but there isn't any package that contains these classes in 203.

1
Are you saying you find it a bit "stupid" having to write your own simple map task that passes the output? What is your question? Will you be getting the same key from the output of different reducers? - Pradeep Gollakota
I am saying that i simply want to chain 2 jobs, without having to pass the output from the first job to the second. It is very simple to do that with 2 jobs, i know because i have already done it. But since hadoop has an optimized way to chain jobs, with reduced I/O, i simply want to use this way. But i always find deprecated examples. I have 3 books about hadoop, and they all have deprecated examples. By now i have found another way to do it, more efficiently than having 2 jobs, i am keeping this post, since i cant find any other post about chaining.(for 203 version).Thanks for your interest - jojoba

1 Answers

0
votes

You can consider using oozie. Creating a workflow would be much easier.