slow s3Distcp when copying from s3 to hdfs

Question

I am using s3disctcp to copy 31,16,886 files(300 GB) from S3 to HDFS and it took 4 days to just copy 10,48,576 files .I killed the job and need to understand how can i reduce this time or what am i doing wrong.

s3-dist-cp --src s3://xml-prod/ --dest hdfs:///Output/XML/

Its on AWS EMR machine.

Well, i used a bigger instance of EMR, m4.4xlarge. the S3 and the EMR were in the same region. — Priyanka O
i had the same observation as this post here ->stackoverflow.com/questions/38462480/… — Priyanka O

Denis Denis · Accepted Answer · 2017-02-28T00:11:04

The issue is in HDFS and its poor performance when dealing with lots of small files. Consider combining files before putting them into HDFS. groupby option of s3distcp command provides one way of doing that.

slow s3Distcp when copying from s3 to hdfs

2 Answers