0
votes

I have a 5 node Hadoop cluster in which 2 nodes are dedicated to be data nodes and also running tasktracker.

I run my hadoop job like

sudo -u hdfs hadoop jar /tmp/MyHadoopJob2.jar com.abhi.MyHadoopJob2 -D mapred.reduce.tasks=2 /sample/cite75_99.txt /output3

the job runs successfully and I can see the correct output ... but now when I go to the portal

http://jt1.abhi.com:50030

I can see

enter image description here

So only 1 reduce job is being run.

The reason I am so particular about running multiple reduce jobs is that I want to confirm whether hadoop will still create a perfectly sorted output file even when different instances of reduce jobs were running on different machine?

currently my output file is fully sorted but this is because there is only 1 reducer job being run.

2
You will get two files with two reducers, not one. - Mike Park
Ah. So doesn't this create a devil and deep sea problem. if we have 1 reducer, it will crash if the input is too large. but if we have multiple reducers, then we don't get a single output? - Knows Not Much
Right. Multiple outputs because two machines cant write to the same file co currently. You can still achieve a single sorted file after concatenating the files. - Mike Park

2 Answers

1
votes

The number of output files would be based on the number of reducers for your given job. But still you can merge the multiple files to one file if your requirement demands.

To merge use below hadoop shell command

command> hadoop fs -getmerge <src> <localdst>
src: hdfs output folder path
localdst: local system path with filename(one file)

Hope this may clarify your doubts.

1
votes

Reducer has 2 jobs: 1. to reduce the mapped key,value pairs 2. to combine two mapper outputs while doing so

since you have only 2 datanodes only 2 mappers can run simultaneously which allows only one possible reducer at any given moment