0
votes

I have a large document corpus as an input to a MapReduce job (old hadoop API). In the mapper, I can produce two kinds of output: one counting words and one producing minHash signatures. What I need to do is:

  1. give the word counting output to one reducer class (a typical WordCount reducer) and
  2. give the minHash signatures to another reducer class (performing some calculations on the size of the buckets).

The input is the same corpus of documents and there is no need to process it twice. I think that MultipleOutputs is not the solution, as I cannot find a way to give my Mapper output to two different Reduce classes.

In a nutshell, what I need is the following:

               WordCounting Reducer   --> WordCount output
             / 

Input --> Mapper

             \ 
              MinHash Buckets Reducer --> MinHash output

Is there any way to use the same Mapper (in the same job), or should I split that in two jobs?

2

2 Answers

4
votes

You can do it, but it will involve some coding tricks (Partitioner and a prefix convention). The idea is for mapper to output the word prefixed with "W:" and minhash prefixed with "M:". Than use a Partitioner to decide into which partition (aka reducer) it needs to go into.

Pseudo code MAIN method:

Set number of reducers to 2

MAPPER:

.... parse the word ...
... generate minhash ..
context.write("W:" + word, 1);
context.write("M:" + minhash, 1);

Partitioner:

IF Key starts with "W:" { return 0; } // reducer 1
IF Key starts with "M:" { return 1; } // reducer 2

Combiner:

IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;} 
Iterate and context.write all of the values

Reducer:

IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;} 
IF Key starts with "M:" { perform min hash logic }

In the output part-0000 will be you word counts and part-0001 your min hash calculations.

Unfortunately it is not possible to provide different Reducer classes, but with IF and prefix you can simulate it.

Also having just 2 reducers might not be an efficient from performance point of view, than you could play with Partitioner to allocate first N partitions to the Word Count.

If you do not like the prefix idea than you would need to implement secondary sort with custom WritableComparable class for the key. But it is worth the effort only in more sophisticated cases.

0
votes

AFAIK this is not possible in a single map reduce job , only the default out-put files part--r--0000 files will be fed to reducer, so so if you are creating two multiple named outputs naming WordCount--m--0 and MinHash--m--0

you can create two other different Map/Reduce job with Identity Mapper and the respective Reducers, specifying the inputs as hdfspath/WordCount--* and hdfspath/MinHash--* as a input to the respective jobs.