Multiple files as input on Amazon Elastic MapReduce

Question

I'm trying to run a job on Elastic MapReduce (EMR) with a custom jar. I'm trying to process about a 1000 files in a single directory. When I submit my job with the parameter s3n://bucketname/compressed/*.xml.gz, I get a "matched 0 files" error. If I pass just the absolute path to a file (e.g. s3n://bucketname/compressed/00001.xml.gz), it runs fine, but only one file gets processed. I tried using the name of the directory (s3n://bucketname/compressed/), hoping that the files within will be processed, but that just passes the directory to the job.

At the same time, I have a smaller local hadoop installation. In that, when I pass my job with wildcards (/path/to/dir/on/hdfs/*.xml.gz), it works fine and all 1000 files are listed correctly.

How do I get EMR to list all my files?

Alternately, how do I list the files within a directory in s3 from the code? I can then generate the paths from those files. — Shashank Agarwal
It works now! There was an empty file called compressed in the same bucket. As soon as I deleted the empty file, the program started working. — Shashank Agarwal

Arsen Zahray Arsen Zahray · Accepted Answer · 2011-09-28T21:42:10

I don't know how EMR lists all the files, but here's a piece of code which works for me:

        FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration());
        FileStatus[] files = fs.listStatus(new Path(args[0]));
        for(FileStatus sfs:files){
            FileInputFormat.addInputPath(job, sfs.getPath());
        }

It will list all the files which are in the input directory, and you can do to those anything that you will

Multiple files as input on Amazon Elastic MapReduce

1 Answers