I have a Mapreduce job working with small amounts of data (200 MB). The map phase is computationally simple, but the reduce phase can be computationally expensive, taking much more time to analyze one input. Given 32 MB of split size, I see that in the map phase all machines are computing, but in the reduce phase only one is, and the reduce phase goes through much slower. Is there a way to make splits smaller only for the reduce phase of a job so that I can use all the machines for the reduce phase?
1 Answers
1
votes
Split size does not affect reduce parallelism. It only drives the number of mappers.
MapReduce mandates you to specify the number of reducer to use. You can set the mapreduce.job.reduces
property, which default to 1, or use Job.setNumReduceTasks(int tasks)
(see javadoc). Here, you want to increase this number.
Higher level tools like Apache Crunch automatically set the number of reducer using from the size of the input, a provided scale factor and a target input size for each reducer. If hard coding a number of task does not fit your needs, you can easily implement a similar strategy.
Reduce input groups
– Binary Nerd