hadoop cp vs streaming with /bin/cat as mapper and reducer

Question

I am new to Hadoop, have very basic question on hadoop copy (cp) vs hadoop streaming if /bin/cat is used for mapper and reducer.

hadoop -input -output -mapper /bin/cat -reducer /bin/cat

I believe above command would copy the files (how is it different from hadoop cp?) or correct me if my understanding is wrong.

Ashrith Ashrith · Accepted Answer · 2014-11-11T23:17:57

They kind of do the same thing but in different fashion:

hadoop cp will just invoke the JAVA HDFS API and performs a copy to another specified location, which is way faster than streaming solution.
hadoop streaming on the other (see the example command below) will kick off a mapreduce job. Hence like any other mapreduce job it has to go through map -> sort & shuffle -> reduce phases which will take a long time to complete depending on your input dataset size. Because of the default sort & shuffle phase your input data also gets sorted in the output directory.
```
hadoop jar /path/to/hadoop-streaming.jar \
-input /input/path
-output /output/path
-mapper /bin/cat
-reducer /bin/cat
```