I have a large document corpus as an input to a MapReduce job (old hadoop API). In the mapper, I can produce two kinds of output: one counting words and one producing minHash signatures. What I need to do is:
- give the word counting output to one reducer class (a typical WordCount reducer) and
- give the minHash signatures to another reducer class (performing some calculations on the size of the buckets).
The input is the same corpus of documents and there is no need to process it twice. I think that MultipleOutputs is not the solution, as I cannot find a way to give my Mapper output to two different Reduce classes.
In a nutshell, what I need is the following:
WordCounting Reducer --> WordCount output /
Input --> Mapper
\ MinHash Buckets Reducer --> MinHash output
Is there any way to use the same Mapper (in the same job), or should I split that in two jobs?