1
votes

I am using s3disctcp to copy 31,16,886 files(300 GB) from S3 to HDFS and it took 4 days to just copy 10,48,576 files .I killed the job and need to understand how can i reduce this time or what am i doing wrong.

s3-dist-cp --src s3://xml-prod/ --dest hdfs:///Output/XML/

Its on AWS EMR machine.

2
Well, i used a bigger instance of EMR, m4.4xlarge. the S3 and the EMR were in the same region. - Priyanka O
i had the same observation as this post here ->stackoverflow.com/questions/38462480/… - Priyanka O

2 Answers

0
votes

The issue is in HDFS and its poor performance when dealing with lots of small files. Consider combining files before putting them into HDFS. groupby option of s3distcp command provides one way of doing that.

0
votes

Why not do the entire process as part of a single application pipeline? That way you don't have to store lot of small intermediate files in HDFS.

S3 File Reader --> XML Parser --> Pick Required Fields --> Parquet Writer (single file with rotation policy)