0
votes

This stage is a join between table A (100k rows) and B (5 million rows) on a key.

Table A is only 2 columns table with id as the match key.

Tried a lot of things to convert this stage to Map join instead of common join, but still it is running as common join taking long time. Any suggestions for speeding it up?

Also, why always 67% reduce happens so quickly and after that it goes step by step taking long time?

2015-12-21 01:12:55,635 Stage-2 map = 0%,  reduce = 0%
2015-12-21 01:13:39,342 Stage-2 map = 20%,  reduce = 0%, Cumulative CPU 5.49 sec
2015-12-21 01:13:43,618 Stage-2 map = 40%,  reduce = 0%, Cumulative CPU 31.79 sec
2015-12-21 01:13:45,692 Stage-2 map = 60%,  reduce = 0%, Cumulative CPU 34.42 sec
2015-12-21 01:13:46,735 Stage-2 map = 73%,  reduce = 0%, Cumulative CPU 45.1 sec
2015-12-21 01:13:48,812 Stage-2 map = 80%,  reduce = 0%, Cumulative CPU 46.87 sec
2015-12-21 01:13:57,125 Stage-2 map = 93%,  reduce = 0%, Cumulative CPU 60.03 sec
2015-12-21 01:13:58,160 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 61.46 sec
2015-12-21 01:14:42,001 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 72.34 sec
2015-12-21 01:15:42,196 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 141.27 sec
2015-12-21 01:16:31,357 Stage-2 map = 100%,  reduce = 68%, Cumulative CPU 183.86 sec
2015-12-21 01:17:31,587 Stage-2 map = 100%,  reduce = 68%, Cumulative CPU 245.5 sec
2015-12-21 01:18:31,840 Stage-2 map = 100%,  reduce = 68%, Cumulative CPU 306.58 sec
2015-12-21 01:19:32,275 Stage-2 map = 100%,  reduce = 68%, Cumulative CPU 371.49 sec
2015-12-21 01:20:32,549 Stage-2 map = 100%,  reduce = 68%, Cumulative CPU 433.61 sec
2015-12-21 01:20:58,591 Stage-2 map = 100%,  reduce = 69%, Cumulative CPU 457.46 sec
2015-12-21 01:21:58,904 Stage-2 map = 100%,  reduce = 69%, Cumulative CPU 516.95 sec
2015-12-21 01:22:59,143 Stage-2 map = 100%,  reduce = 69%, Cumulative CPU 576.51 sec
2015-12-21 01:23:59,480 Stage-2 map = 100%,  reduce = 69%, Cumulative CPU 636.39 sec
2015-12-21 01:24:59,810 Stage-2 map = 100%,  reduce = 69%, Cumulative CPU 692.75 sec
2015-12-21 01:25:59,978 Stage-2 map = 100%,  reduce = 69%, Cumulative CPU 757.39 sec
2

2 Answers

1
votes

Your reducers are progressing slowly step by step and taking time to complete.

A map reduce job essentially three stages: Map task, Shuffle and Reducer task.

Each of these stage contribute 33.33% completion for the overall job completion. Here first two stages Map task and Shuffle of data is getting completed. That is why the you are seeing Reducer has completed 67%. Rest of the completion depends on the progress of Reducer task. The Reducer side join is taking time.

0
votes

You can use set mapreduce.job.reduces=<number_of_reducers>. If it does not speed up, paste the complete log. You can start with as 4 and see if it is improving performance.

Also give some details about configuration of your cluster. Single node or multi node, if it is multi node, how many etc.