5
votes

For a given MR job, i need to produce two output files. One file should be the output of Mapper Another file should be the output of Reducer (which is just an aggregation of above Mapper)

Can I have the both the mapper and reducer output be written in a single job?

EDIT:

In Job 1 (Only Mapper phase) Output contains 20 fields in a single row, which has to be written to hdfs(file1). In Job 2 (Mapper n reducer) Mapper takes input from Job1 output, deletes few fields to bring into a standard format(only 10 fields) and pass it to reducer which writes file2.

I need both file1 and file2 in hdfs... Now My doubt is, whether in Job1 mapper can I write data into hdfs as file1, then modify the same data and pass it to reducer.

PS : As of now I am using 2 jobs with chaining mechanism. First job contains only mapper, seconds job contains mapper and reducer.

1
I guess, you can use a single MR job for producing the result. Are you doing any transformations in the Mapper of second job, if not, then pass the output of Mapper into the Reducer in a single MR job. - YoungHobbit
I the second mapper i am just modifying the number of columns for a single row.. For eg: output of mapper (file1) contains 20 columns, output of mapper2 contains 7 columns. Duplicate rows from mapper2 will be removed in reducer. - Abhinay
If you can do that in first jobs mapper then do it there and merge the jobs. Otherwise please detailed information about both the jobs. - YoungHobbit
In Job 1 (only Mapper) Output contains 20 fields in a single row, which has to be written to hdfs(file1).... In Job 2 (Mapper n reducer) Mapper takes input from Job1 output, deletes few fields to bring into a standard format(only 10 fields) and pass it to reducer which write file2. I need both file1 and file2 in hdfs... Now My doubt is, whether in Job1 mapper can I write data into hdfs as file1, then modify it and send it to reducer. - Abhinay
Always add the information into the OP. Because it is easy to read and available directly to the future readers. - YoungHobbit

1 Answers

2
votes

You could perhaps use the MultipleOutputs class to define one output for the mapper and (optionally) one for the reducer. For the mapper, you will have to write things twice: once for the output file (using MultipleOutputs) and once for emitting pairs to the reducer (as usual).

Then, you could also take advantage of ChainMapper class, to define the following workflow in a single job:

Mapper 1 (file 1) -> Mapper 2 -> Reducer (file 2)

To be honest, I 've never used this logic, but you can give it a try. Good luck!