s3-dist-cp and hadoop distcp job infinitely loopin in EMR

Question

I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp:

s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/

hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/

I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my cluster, and the hadoop job runs forever. As soon as i run the command,

16/07/18 18:43:55 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:44:02 INFO mapreduce.Job: map 100% reduce 0%
16/07/18 18:44:08 INFO mapreduce.Job: map 100% reduce 14%
16/07/18 18:44:11 INFO mapreduce.Job: map 100% reduce 29%
16/07/18 18:44:13 INFO mapreduce.Job: map 100% reduce 86%
16/07/18 18:44:18 INFO mapreduce.Job: map 100% reduce 100%

This gets printed immediately and then copies over data for an hour. It starts all over again.

16/07/18 19:52:45 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:52:53 INFO mapreduce.Job: map 100% reduce 0%

Am i missing anything here? Any help is appreciated.

Also I would like to know where can i find the log files on the master node to see if the job is failing and hence looping? Thanks

Xuehua Jiang Xuehua Jiang · Accepted Answer · 2017-05-26T05:16:38

In my case, I copy a single large compressed file from hdfs to s3, and hadoop distcp is much faster then s3-dist-cp.

When I check log, multi uploading part takes very long time at reduce step. Uploading a block(134MB) takes 20 secs for s3-dist-cp, while it takes only 4 secs for hadoop distcp.

Difference between distcp and s3-dist-cp is distcp creates temp files at s3(at destination file system), while s3-dist-cp creates temp files at hdfs.

I am still investigating why multi uploading performance is much different with distcp and s3-dist-cp, hope some one with good insights can contribute here.

s3-dist-cp and hadoop distcp job infinitely loopin in EMR

2 Answers