0
votes

I know that Apache Hadoop provided discp to copy files from aws s3 to HDFS. But seems that it is not that efficient, and the logging is inflexible.

In my project, it is required to write log in our customized format after each file transfer to HDFS succeeds or fails. Due to the big amount of data loading, it is definitely the most efficient to load aws data into HDFS cluster with Hadoop MapReduce, say I am going to write a Hadoop MapReduce job similar to discp.

My plan is to let each Mapper on each node to load one s3 directory with aws Java SDK as there are many s3 directories to be loaded to HDFS. Could anyone give some suggestion about how to achieve this goal? Thanks in advance!

1
Are you using AWS EMR? If so, have you tried reading directly from the S3 buckets?filipebarretto

1 Answers

0
votes

Have you tried s3a, s3a is a successor to the orignal s3n - removes some limitations (file size) and improves performance? Also what seems to be the problem with distcp - which filesystem are you using for S3 (s3n or s3a?)? There has been some amount of work done recently in distcp - it might worth checking the newest version.