3
votes

It sounds like a simple job, but with MapReduce it doesn't seem that straight-forward.

I have N files in which there is only one line of text for each file. I'd like the Mapper to output key value pairs like < filename, score >, in which 'score' is an integer calculated from the line of text. As a sidenote I am using the below snippet to do so (hope it's correct).

 FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
 String fileName = fileSplit.getPath().getName();

Assuming the mapper does its job correctly, it should output N key value pairs. Now the problem is how should I program the Reducer to output the one key value pair with the maximum 'score'?

From what I know Reducer only works with key value pairs that share the same key. Since the output in this scenario all have different keys, I am guessing something should be done before the Reduce step. Or perhaps should the Reduce step be omitted altogether?

5

5 Answers

2
votes

You can use the setup() and cleanup() methods (configure() and close() methods in old API). Declare a global variable in reduce class, which determines the maximum score. For each call to reduce, you would compare the input value (score) with the global variable.

Setup() is called once before all reduce invocations in the same reduce task. Cleanup() is called after last reduce invocation in the same reduce task. So, if you have multiple reducers, Setup() and cleanup() methods would be called separately on each reduce task.

3
votes

Lets assume that

File1 has 10,123,23,233

File2 has 1,3,56,1234

File3 has 6,1,3435,678


Here is the approach for finding the maximum number from all the input files.

  1. Lets first do some random sampling (like say every N records). From File1 123 and 10, from File2 56 and 1, from File3 1 and 678.

  2. Pick the maximum number from the random sampling, which is 678.

  3. Pass the maximum number from the random sampling to the mapper and ignore the input numbers less the maximum number found in the random sampling and emit the others in the mappers. Mappers will ignore anything less than 678 and emit 678, 1234 and 3435.

  4. Configure the job to use 1 reducer and find the max of all the numbers sent to the reducer. In this scenario reducer will receive 678, 1234 and 3435. And will calculate the maximum number to be 3435.


Some observations of the above approach

  1. The data has to be passed twice.

  2. The data transferred between the mappers and reducers is decreased.

  3. The data processed by the reducers also decreases.

  4. Better the input sampling, faster the Job completes.

  5. Combiner with similar functionality as the Reducer will further improve the Job time.

0
votes

You can return the the filename and the score as the value and just return any constant as the key from your mapper

0
votes

Refer slide 32 & 33 of http://www.slideshare.net/josem.alvarez/map-reduceintro

I used the same approach and got the result. Only concern is when you have multiple fields, you need to create fieldnamemin and fieldnamemax individually.

0
votes

Omit the Reducer !! Use the Configuration to set the global variable as score and key and then access it in the mapper to do a simple selection of max score by using the global variable as the memory of max score and key It should be simple. I guess.