1
votes

I have a Mapreduce job working with small amounts of data (200 MB). The map phase is computationally simple, but the reduce phase can be computationally expensive, taking much more time to analyze one input. Given 32 MB of split size, I see that in the map phase all machines are computing, but in the reduce phase only one is, and the reduce phase goes through much slower. Is there a way to make splits smaller only for the reduce phase of a job so that I can use all the machines for the reduce phase?

1
How many key groups do your mappers produce?Binary Nerd
@BinaryNerd How can I know that?user4052054
Look at the counters for your job, its a standard counter. You should also know what your key is and how granular you would expect it to be. For example if you have one key, only one reducer will run.Binary Nerd
@BinaryNerd INFO mapreduce.Job: Counters: 52user4052054
Right, its telling you there are 52 different counters. After that there should be a counter for Reduce input groupsBinary Nerd

1 Answers

1
votes

Split size does not affect reduce parallelism. It only drives the number of mappers.

MapReduce mandates you to specify the number of reducer to use. You can set the mapreduce.job.reduces property, which default to 1, or use Job.setNumReduceTasks(int tasks) (see javadoc). Here, you want to increase this number.

Higher level tools like Apache Crunch automatically set the number of reducer using from the size of the input, a provided scale factor and a target input size for each reducer. If hard coding a number of task does not fit your needs, you can easily implement a similar strategy.