We have a requirement to copy files within Spark job (runs on Hadoop cluster spun up by EMR) to respective S3 bucket. As of now, we are using Hadoop FileSystem API (FileUtil.copy) to copy or move files between two different file systems.
val config = Spark.sparkContext.hadoopConfiguration
FileUtil.copy(sourceFileSystem, sourceFile, destinationFileSystem, targetLocation, true, config)
This method works as required but not efficient. It streams a given file and execution time depends on the size of file and number of files to be copied.
In another similar requirement to move files between two folders of same S3 bucket, we are using functionalities of com.amazonaws.services.s3 package as below.
val uri1 = new AmazonS3URI(sourcePath)
val uri2 = new AmazonS3URI(targetPath)
s3Client.copyObject(uri1.getBucket, uri1.getKey, uri2.getBucket, uri2.getKey)
The above package only has methods to copy/ move between two S3 locations. My requirement is to copy files between HDFS (on a cluster spun up by EMR) and root S3 bucket. Can anyone suggest a better way or any AWS S3 api available to use in spark scala for moving files between HDFS and S3 bucket.
aws s3 cp ./my-source-file s3://my-bucketto copy a local file to a bucket so long as the cli profile you're using has the right permissions. you can just include a shell script that take in the needed variables and run it from scala. - bryan60