I know that Apache Hadoop provided discp
to copy files from aws s3 to HDFS. But seems that it is not that efficient, and the logging is inflexible.
In my project, it is required to write log in our customized format after each file transfer to HDFS succeeds or fails. Due to the big amount of data loading, it is definitely the most efficient to load aws data into HDFS cluster with Hadoop MapReduce, say I am going to write a Hadoop MapReduce job similar to discp
.
My plan is to let each Mapper on each node to load one s3 directory with aws Java SDK
as there are many s3 directories to be loaded to HDFS. Could anyone give some suggestion about how to achieve this goal? Thanks in advance!