I'm working on an AWS-EMR cluster and added a step to run S3DISTCP (https://docs.aws.amazon.com/es_es/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html), this is in order to copy objects from an s3 bucket (target/destination is also an s3 bucket).
Objects are copied correctly to the destination bucket and using --deleteOnSuccess option copied objects deleted from source bucket as expected. The problem here is, for every folder that contained a copied object (on the source bucket), there is a new file created at the root of the source bucket (this only happens with --deleteOnSuccess option).
Arguments that I use are:
s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://MY_SOURCE_BUCKET/ --dest=s3://MY_DESTINATION_BUCKET/ --srcPrefixesFile=s3://ANOTHER_BUCKET/objects_list.txt --deleteOnSuccess
In this case, if in s3://MY_SOURCE_BUCKET/ contains:
s3://MY_SOURCE_BUCKET/
|--folder_a/
| |------ a.txt
| |------ b.txt
| |------ c.txt
|--folder_b/
|------ d.txt
and if I want to copy and delete only s3://MY_SOURCE_BUCKET/folder_a/b.txt, once S3DISTCP run is completed, source bucket looks like:
s3://MY_SOURCE_BUCKET/
|--folder_a_$folder$ <-- This is the new file created with `_$folder$` suffix
|--folder_a/
| |------ a.txt
| |------ c.txt
|--folder_b/
|------ d.txt
Is there a way to avoid this new files are created by S3DISTCP on the source bucket?