Designing of the "mapper" and "reducer" functions' functionality for hadoop?

Question

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.

The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.

My questions are:

(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.

(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?

It seems that Hadoop does the shuffling automatically for the user? So I think my question 2 now does not make sense? — 24n8
I would recommend looking at Spark GraphX or JanusGraph rather than plain MapReduce on Hadoop — OneCricketeer
Thanks. Can you expound on why? Are those better for this application, or better overall? — 24n8
Both... JanusGraph is a database to hold your data, and graph queries can be ran on that. Spark w/ GraphX is just a better processing engine than MapReduce for Graph-like data. You can use Spark Graphframes package to load your initial data into a GraphX object — OneCricketeer
No... Spark reads data from HDFS and can run jobs on YARN. Just no one really writes MapReduce that much anymore — OneCricketeer

vefthym vefthym · Accepted Answer · 2019-01-30T00:32:14

If my understanding is correct, you want to count how many distinct values a key will have.

Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.

In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).

Two easy ways to improve the efficiency are:

1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.

2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).

You can lookup both methods using the keywords "combiner" and "map-side aggregation".

A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though: 1) you use a single reducer, but then you don't gain anything from parallelism, 2) you use a total order partitioner, which needs some extra coding.

Other than that, you may want to move to Spark for a more intuitive and efficient solution.

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

1 Answers