1
votes

I am new to Hadoop, have very basic question on hadoop copy (cp) vs hadoop streaming if /bin/cat is used for mapper and reducer.

hadoop -input -output -mapper /bin/cat -reducer /bin/cat

I believe above command would copy the files (how is it different from hadoop cp?) or correct me if my understanding is wrong.

1

1 Answers

0
votes

They kind of do the same thing but in different fashion:

  • hadoop cp will just invoke the JAVA HDFS API and performs a copy to another specified location, which is way faster than streaming solution.
  • hadoop streaming on the other (see the example command below) will kick off a mapreduce job. Hence like any other mapreduce job it has to go through map -> sort & shuffle -> reduce phases which will take a long time to complete depending on your input dataset size. Because of the default sort & shuffle phase your input data also gets sorted in the output directory.

    hadoop jar /path/to/hadoop-streaming.jar \
    -input /input/path
    -output /output/path
    -mapper /bin/cat
    -reducer /bin/cat