3
votes

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:

ProcessName Time
process1    10
process2    20
processn    30

I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

Thanks.

2

2 Answers

3
votes

Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like

ProcessName Time
process1    10
process2    20
.
.
.

Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like

private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
    String[] fields = value.split("\t");
    output.set(Integer.parseInt(fields[1]));
    context.write(one, output);
}

Your reducer takes these values, and simply computes the average. This would look something like

IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
    int sum = 0;
    int count = 0;
    for(IntWritable value : values) {
        sum += value.get();
        count++;
    }
    average.set(sum / (double) count);
    context.Write(key, average);
}

I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.

Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

3
votes

Your Mappers read the text file and apply the following map function on every line

map: (key, value)
  time = value[2]
  emit("1", time)

All map calls emit the key "1" which will be processed by one single reduce function

reduce: (key, values)
  result = sum(values) / n
  emit("1", result)

Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.

Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines. A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:

combine: (key, values)
  emit(key, sum(values))

This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between. The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.