3
votes

Is it possible to share a value between successive reducer and mapper?

Or is it possible to store the output of first reducer into memory and second mapper can access that from memory ?

Problem is , I had written a chain map reducer like Map1 -> Reducer1 --> Map2 --> Reducer2.

Map1 and Map2 is reading the same input file.

Reduce1 is deriving a value suppose 'X' as its output.

I need 'X' and input file for Map2.

How can we do this without reading the output file of Reduce1?

Is it possible store 'X' in memory to access for Mapper 2 ?

2
Can you be more specific? Why you ever need Map2 and can't combine Reducer1 and Reducer2 in one Reducer?octo
My Requirement is , I have an input in a text file like ID,PRICE,COUNT. I need to find sum(Count) first (Done with first reducer) . Then need to calculate A = count*100/sum for each row and then need to find cumulative A for each row , then if am entering a value (ex:64) need to find which row corresponding to that.(which is the first row having cumulative A > 64)vinu.m.19

2 Answers

4
votes

Each job is independent of each other, so without storing the output in intermediate location it's not possible to share the data across jobs.

FYI, in MapReduce model the map tasks don't talk to each other. Same is the case for reduce tasks also. Apache Giraph which runs on Hadoop uses communication between the mappers in the same job for iterative algorithms which requires the same job to be run again and again without communication between the mappers.

Not sure about the algorithm being implemented and why MR, but every MR algorithm can be implemented in BSP also. Here is a paper comparing BSP with MR. Some of the algorithms perform well in BSP when compared to MR. Apache Hama is an implementation of the BSP model, the way Apache Hadoop is an implementation of MR.

1
votes

If number of distinct rows produced by Reducer1 is small (say you have 10000 (id,price) tuples), using two stage processing is preferred. You can load results from first map/reduce into memory in each Map2 mapper and filter input data. So, no uneeded data will be transferred via network and all data will be processed locally. With use of combineres amount of data can be even less.

In case of huge amount of distinct rows looks like you need to read data twice.