7
votes

Now I have a 4-phase MapReduce job as follows:

Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output

I notice that there is ChainMapper class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer class, however it is not a real "Chain-Reducer". It can only support jobs like:

[Map+/ Reduce Map*]

I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?

I am using Hadoop 1.0.4.

1
I, too, would be very interested in this, if it is possible.Jim Pivarski

1 Answers

3
votes

I dont think that you can have the o/p of a reducer being given to another reducer directly. I would have gone for this:

Input-> Map1 -> Reduce1 -> 
        Identity mapper -> Reducer2 -> 
                Identity mapper -> Reduce3 -> 
                         Identity mapper -> Reduce4 -> Output

In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer.