Need help in writing Map/Reduce job to find average

Question

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:

ProcessName Time
process1    10
process2    20
processn    30

I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

Thanks.

Possible duplicate of Find the average of numbers using map-reduce — kellyfj

jason jason · Accepted Answer · 2013-08-05T16:07:20

Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like

ProcessName Time
process1    10
process2    20
.
.
.

Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like

private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
    String[] fields = value.split("\t");
    output.set(Integer.parseInt(fields[1]));
    context.write(one, output);
}

Your reducer takes these values, and simply computes the average. This would look something like

IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
    int sum = 0;
    int count = 0;
    for(IntWritable value : values) {
        sum += value.get();
        count++;
    }
    average.set(sum / (double) count);
    context.Write(key, average);
}

I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.

Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Need help in writing Map/Reduce job to find average

2 Answers