0
votes

When I look at my Hadoop screen, I see stats like

Average Map Time    5mins, 56sec
Average Shuffle Time    6mins, 27sec
Average Merge Time  4mins, 25sec
Average Reduce Time 3mins, 51sec

From what I understand, MapReduce works something like

  1. Map step: Use "mapper" machines to apply some transformation to each line of input, which outputs a key-value pair for each line.
  2. Shuffle step: Take these key-value pairs, and group together pairs with the same key, assigning pairs with the same key to the same "reducer" machine.
  3. Reduce step: Apply a "reduce" transformation on all pairs with the same key, to produce one result for each group.

So I think I know what "map", "shuffle", and "reduce" are. But what is "merge?"

1
group together pairs that exactly involves merging ;) - Thomas Jungblut
@ThomasJungblut Wait, then what is "shuffle?" - Jessica
Shuffle consists of two parts: downloading map outputs and merging them into a single file that acts as the reduce input. I think they just splitted the metrics to show bottlenecks on deserialization vs. downloads, although these happen mostly in parallel so I don't know how helpful that metric is. - Thomas Jungblut
Merge phase in Map Reduce refers to clubbing the KeyValue Pairs which are given out by the Reducer. And Shuffle refers to directing the key-value pairs to the assigned processors. For an example, One processor handles only keys which have even values and the other processor handles odd values, then after Map or Combiner phase the keys are separated and even key-value pairs are directed to its respective processor and the odd-one to its respective. For Merge have a look on this slides.link - letsBeePolite

1 Answers

1
votes

Shuffle and merge overlap:

The metric is listed as "the time delta between the end of the shuffle and the start of the reduce"

You can see in these patch notes "The shuffle and merge phases overlap in practice, but really what we're looking for here is excessive time spent merging even after the data has been shuffled to the reducer."

So the steps happen together in the process but they are just calculating the metrics of the additional merge time needed.

Source: https://issues.apache.org/jira/browse/MAPREDUCE-5059