3
votes

By default if a mapper/reducer fails, hadoop tries to run other instance of it and if it fails 4 times(default value) hadoop marks complete MR job as failed.

I am processing some raw data and i am ok if MR job fails to process 30% of data. is there any configuration by which I can set if 30% of mappers fail don't kill the job and give output of remaining 70% of data. I can handle exceptions in my code and maintain failed and success records in counter but i want to know is there any such config in hadoop

1
Skipping Bad Records may be helpful for you. But now there is no such threshold.zsxwing
Another solution would be use your own counter to count % of failures. Use try-catch block to capture error and increment your counter in the catch block.Arijit Banerjee

1 Answers

4
votes

Thanks! I got the answer from definitive guide.

For some applications, it is undesirable to abort the job if a few tasks fail, as it may be possible to use the results of the job despite some failures. In this case, the maximum percentage of tasks that are allowed to fail without triggering job failure can be set for the job. Map tasks and reduce tasks are controlled independently, using the mapreduce.map.failures.maxpercent and mapreduce.max.reduce.failures.percent properties.