Now I have a 4-phase MapReduce job as follows:
Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output
I notice that there is ChainMapper
class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer
class, however it is not a real "Chain-Reducer". It can only support jobs like:
[Map+/ Reduce Map*]
I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?
I am using Hadoop 1.0.4.