9
votes

In a typical MapReduce setup(like Hadoop), how many reducer is used for 1 task, for example, counting words? My understanding of that MapReduce from Google means only 1 reducer is involved. Is that correct?

For example, the word count will divide the input into N chunks, and N Map will be running, producing the (word,#) list. My question is, once the Map phase is done, will there be only ONE reducer instance running to compute the result? or there will be reducers running in parallel?

5
Your question is lacking some more context. Do you have a particular mapreduce framework you are referring to i.e. Hadoop. And if so are you asking how many reduce "tasks" will be associated with each map "task"?diliop
The short answer is that there will be a configurable number of reducers (at least 1).Spike Gronim

5 Answers

14
votes

The simple answer is that the number of reducers does not have to be 1 and yes, reducers can run in parallel. As I mentioned above this is user defined or derived.

To keep things in context I will refer to Hadoop in this case so you have an idea of how things work. If you are using the streaming API in Hadoop (0.20.2) you will have to explicitly define how many reducers you would like to run since by default, only 1 reduce task will be launched. You do so by passing the number of reducers to the -D mapred.reduce.tasks=# of reducers argument. The Java API will try to derive the number of reducers you will need but again you can explicitly set that too. In both cases, there is a hard cap on the number of reducers you can run per node and that is set in your mapred-site.xml configuration file using mapred.tasktracker.reduce.tasks.maximum.

On a more conceptual note, you can look at this post on the hadoop wiki that talks about choosing the number of map and reduce tasks.

2
votes

I case of simple wordcount example it would make sense to use only one reducer.
If you want to have as a result of computation only one number you have to use one reducer (2 or more reducers would give you 2 or more output files).

If this reducer is taking long time to complete you can think of chaining multiple reducers where reducers in next phase would sum results from previous reducers.

1
votes

This depends entirely on the situation. In some cases, you don't have any reducers...everything can be done mapside. In other cases, you cannot avoid having one reducer, but generally this comes in a 2nd or 3rd map/reduce job that condenses earlier results. Generally, however, you want to have a lot of reducers or else you are losing a lot of the power of MapReduce! In word count, for example, the result of your mappers will be pairs. These pairs are then partitioned based on the word such that each reducer will receive the same words, and can give you the ultimate sum. Each reducer then outputs the result. If you wanted to, you could then shoot off another M/R job that took all of these files and concatenated them-- that job would only have one reducer.

1
votes

The default value is 1. If you are considering hive or pig,then the number of reducer depends on the query , like group by , sum .....

In case of ur mapreduce code , it can be defined by setNumReduceTasks on job/conf.

job.setNumReduceTasks(3);

Most of the time it is done when you overwrite the getPartition(), i.e. you are using a custom partitioner

class customPartitioner extends Partitioner<Text,Text>{
    public int getPartition(Text key, Text value, int numReduceTasks){
    if(numReduceTasks==0)
        return 0;
    if(some logic)
        return 0;
    if(some logic)
        return 1;
    else
        return 2;
    }
}

One thing you will notice that the number of reducers = the number of part file in the output.

Let me know if you have doubts.

0
votes

The reducers runs in parallel . The number of reducer you have set in your job while changing config file mapred-site.xml or by setting reducer while command of running job or you can set it in the program also that number of reducer will run parallely. Its not necessary to keep it as 1. By default its value is 1.