Any AWS S3 API to move files from HDFS on Amazon EMR to Amazon S3 from spark application

Question

We have a requirement to copy files within Spark job (runs on Hadoop cluster spun up by EMR) to respective S3 bucket. As of now, we are using Hadoop FileSystem API (FileUtil.copy) to copy or move files between two different file systems.

val config = Spark.sparkContext.hadoopConfiguration    
FileUtil.copy(sourceFileSystem, sourceFile, destinationFileSystem, targetLocation, true, config)

This method works as required but not efficient. It streams a given file and execution time depends on the size of file and number of files to be copied.

In another similar requirement to move files between two folders of same S3 bucket, we are using functionalities of com.amazonaws.services.s3 package as below.

val uri1 = new AmazonS3URI(sourcePath)
val uri2 = new AmazonS3URI(targetPath)
s3Client.copyObject(uri1.getBucket, uri1.getKey, uri2.getBucket, uri2.getKey)

The above package only has methods to copy/ move between two S3 locations. My requirement is to copy files between HDFS (on a cluster spun up by EMR) and root S3 bucket. Can anyone suggest a better way or any AWS S3 api available to use in spark scala for moving files between HDFS and S3 bucket.

there are a lot you can use. You can just run the aws cli aws s3 cp ./my-source-file s3://my-bucket to copy a local file to a bucket so long as the cli profile you're using has the right permissions. you can just include a shell script that take in the needed variables and run it from scala. — bryan60
Thanks for your reply. I edited the question to convey my issue in a better way. I cannot use cli because the move i am doing is part of a spark application which needs to be done before terminating the application. — RSG

Mayank Pande Mayank Pande · Accepted Answer · 2019-12-04T05:21:06

We had similar scenerio and we ended up using S3DistCp .

S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly S3.You can use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and across AWS accounts. You can find more details here.

You can refer to this sample java code for same here

Hope this helps !

Any AWS S3 API to move files from HDFS on Amazon EMR to Amazon S3 from spark application

1 Answers