How do I run two different mappers on the same input and have their output sent to a single reducer?

Question

I have some flight data (each line containing origin, destination, flight number, etc) and I need to process it to output flight details between all origins and destinations with one stopover, my idea is to have two mappers (one outputs destination as key and the other outputs origin as key, therefore the reducer gets the stopover location as key and all origin and destination as an array of values). Then I can output flight details with one stopover for all locations in the reducer.

So my question is how do I run two different mappers on the same input file and have their output sent to one reducer.

I read about MultipleInputs.addInputPath, but I guess it needs input to be different (or atleast two copies of the same input).

I am thinking of running the two mapper jobs independently using a workflow and then a third Identity mapper and reducer where I will do the flight calculation.

Is there a better solution that this? (Please do not ask me to use Hive, am not comfortable with it yet) Any guidance on implementing using mapreduce would really help. Thanks.

Fabian Hueske Fabian Hueske · Accepted Answer · 2015-03-30T20:45:58

I think you can do it with just one Mapper.

The Mapper emits each (src,dst,fno,...) input record twice, once as (src,(src,dst,fno,...)) and once as (dst,(src,dst,fno,...)). In the Reducer you need to figure out for each record whether its key is a source or destination and do the stop-over join. Using a flag to indicate the role of the key and a secondary sort can make this a bit more efficient.

That way only a single MR job with one Mapper and one Reducer is necessary for the task.

How do I run two different mappers on the same input and have their output sent to a single reducer?

2 Answers