2
votes

We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:

hadoop fs -rmr s3://mybucket/a/b/myfile.log

This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in the workflow can choke on this file.

(Note, it doesn't seem to matter whether we use -rmr or -rm or whether we use s3:// or s3n:// as the scheme: all of these exhibit the described behavior.)

How do I use the hadoop fs interface to remove files from S3 and be sure not to leave these troublesome files behind?

2

2 Answers

0
votes

I wasn't able to figure out if it's possible to use the hadoop fs interface in this way. However, the s3cmd interface does the right thing (but only for one key at a time):

s3cmd del s3://mybucket/a/b/myfile.log

This requires configuring a ~/.s3cfg file with your AWS credentials first. s3cmd --configure will interactively help you create this file.

0
votes