2
votes

For one of my hadoop jobs, the amount of data fed into my reducer tasks is extremely unbalanced. For instance, if I have 10 reducer tasks, the input size to 9 of them will be in the 50KB range and the last will be close to 200GB. I suspect that my mappers are generating a large number of values for a single key but I don't know what that key is. It's a legacy job and I don't have access to the source code anymore. Is there a way to see the key/value pairs, either as output from the mapper or input to the reducer, while the job is running?

2

2 Answers

1
votes

Try this adding this to your CLI job run: -D mapred.reduce.tasks=0

This should set the number of reducers to 0, which in effect will have the mappers dump output directly to HDFS. However, there may be some code that is overwriting the number of reducers regardless... so this might not work.

If this works, this will show the output of the mapper.

0
votes

You can always count the total amount of values of your keys with a different simple map reduce job.