MapReduce on Hadoop - sending data from the Mapper to the Reducer

Question

I'm trying to implement a MapReduce algorithm for a specific problem. Let's say that in my Mapper I need to handle a large-sized Text Object. My question is summarised in the following example. I have the Text Object: Today is a lovely day and I need to do some processing on the words. So I have two options:

I can send to the Reducer key-value pairs of the form:
```
<1,Today>

<1,is>

<1,a> 

<1,lovely> 

<1,day>
```
I can send the key-value pair <1,Today is a lovely day> to the reducer and then process it, e.g. tokenise the String object.

What is the best approach for this case? In the first case I have to send more data to the reducer but I have no String Object to tokenise as in the second case. However in the second case, I have a smaller amount of data sent by the Mapper.

Alex Alex · Accepted Answer · 2017-03-27T19:47:55

I don't think that you will heavily improve your performance by reducing traffic that way. What is really matter here is that in first case all your data will be grouped before entering to the reducer by word resulting a completely different set of key-value pairs comparing to the second option. I'm not sure you will be able to perform the same operations upon them. Lets say you will have:

<Today is a lovely day>
<Today is another lovely day>

In first case your reducer would operate with grouped pairs of words (assuming that key is word but not a number):

<a, 1> 

<another, 1> 

<day, 2>

<is, 2>

<lovely,2>

<Today, 2>

As you can see reducer input is grouped and sorted, in more advanced scenarios you perform your logic upon values of such input, like finding maximum or searching average.

In second case your keys would be sentences:

<Today is a lovely day, 1>

<Today is another lovely day, 1>

So there is a chance that two different reducers would handle these two pairs. The operations which you can perform upon it would be slightly different from the first one as it would be a different set of data. There is no way how you will be able to perform key-based maximums or average of values like it is done in first case

MapReduce on Hadoop - sending data from the Mapper to the Reducer

1 Answers