4
votes

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.

How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?

1

1 Answers

5
votes

An input split is a chunk of the input that is processed by a single map. Each map processes a single split. Each split is divided into records, and the map processes each record—a key-value pair—in turn.

So mapper treats records with same key as separate records.