We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:
hadoop fs -rmr s3://mybucket/a/b/myfile.log
This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in the workflow can choke on this file.
(Note, it doesn't seem to matter whether we use -rmr
or -rm
or whether we use s3://
or s3n://
as the scheme: all of these exhibit the described behavior.)
How do I use the hadoop fs
interface to remove files from S3 and be sure not to leave these troublesome files behind?