Amazon EMR: Passing an XML or properties file to a JAR

Question

I've been running several map reduce jobs on a hadoop cluster from a single JAR file. The Main of the JAR accepts an XML file as a command line parameter. The XML file contains the input and output paths for each job (name-value property pairs) and I use these to configure each mapreduce job. I'm able to load the paths into the Configuration like so

    Configuration config = new Configuration(false);
    config.addResource(new FileInputStream(args[0]));

I am now trying to run the JAR using Amazon's Elastic MapReduce. I tried uploading the XML file to S3 but of course using FileInputStream to load the paths data from S3 doesn't work (FileNotFound Exception).

How can I pass the XML file to the JAR when using EMR?

(I looked at bootstrap actions but as far as I can tell that's to specify hadoop-specific configurations).

Any insight would be appreciated. Thanks.

Judge Mental Judge Mental · Accepted Answer · 2012-05-09T03:34:34

If you add a simple bootstrap action that does

hadoop fs -copyToLocal s3n://bucket/key.xml /target/path/on/local/filesystem.xml

you will then be able to open a FileInputStream on /target/path/on/local/filesystem.xml as you had intended. The bootstrap action is executed simultaneously on all the master/slave machines in the cluster, so they will all have a local copy.

To add that bootstrap action you'll need to create a shell script file that contains the above command, upload that to S3, and specify it as the script bootstrap action path. Unfortunately a shell script in s3 is currently the only allowable type of bootstrap action.

Amazon EMR: Passing an XML or properties file to a JAR

1 Answers