How to load additional JARs for an Hadoop Streaming job on Amazon EMR

Question

TL;DR

How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)?

Long version

I want to analyze a set of Avro files (> 2000 files) using Hadoop on Amazon Elastic MapReduce (Amazon EMR). It should be a simple exercise through which I should gain some confidence with MapReduce and Amazon EMR (I am new to both).

Since python is my favorite language I have decided to use Hadoop Streaming. I have built a simple mapper and reducer in python, and I have tested it on a local Hadoop (single node install). The command I was issuing on my local Hadoop install was this:

$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.4.0-amzn-1.jar \
                  -files avro-1.7.7.jar,avro-mapred-1.7.7.jar \
                  -libjars avro-1.7.7.jar,avro-mapred-1.7.7.jar \
                  -input "input" \
                  -mapper "python2.7 $PWD/mapper.py"  \
                  -reducer "python2.7 $PWD/reducer.py" \
                  -output "output/outdir" \
                  -inputformat org.apache.avro.mapred.AvroAsTextInputFormat

and the job completed successfully.

I have a bucket on Amazon S3 with a folder containing all the input files and another folder with the mapper and reducer scripts (mapper.py and reducer.py respectively).

Using the interface I have created a small cluster, then I have added a bootstrap action to install all the required python modules on each node and then I have added an "Hadoop Streaming" step specifying the location of the mapper and reducer scripts on S3.

The problem is that I don't have the slightest idea on how I can upload or specify in the options the two JARs - avro-1.7.7.jar and avro-mapred-1.7.7.jar - required to run this job?

I have tried several things:

using the -files flag in combination with -libjars in the optional arguments;
adding another bootstrap action that downloads the JARs on every node (and I have tried to download it on different locations on the nodes);
I have tried to upload the JARs on my bucket and specify a full s3://... path as argument to -libjars (note: these file are actively ignored by Hadoop, and a warning is issued) in the options;

If I don't pass the two JARs the job fails (it does not recognize the -inputformat class), but I have tried all the possibilities (and combinations thereof!) I could think of to no avail.

CristianCantoro CristianCantoro · Accepted Answer · 2015-02-10T13:32:44

In the end, I figures it out (and it was, of course, obvious):

Here's how I have done it:

add a bootstrap action that downloads the JARs on every node, for example you can upload the JARs in your bucket, make them public and then do:

wget https://yourbucket/path/somejar.jar -O $HOME/somejar.jar
wget https://yourbucket/path/avro-1.7.7.jar -O $HOME/avro-1.7.7.jar
wget https://yourbucket/path/avro-mapred-1.7.7.jar -O $HOME/avro-mapred-1.7.7.jar

when you specify -libjars in the optional arguments use the abosolute path, so:

-libjars /home/hadoop/somejar.jar,$HOME/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar

I have lost a number of hours that I am ashamed to say, hope this helps somebody else.

Edit (Feb 10th, 2015)

I have double checked, and I want to point out that it seems that environment variable are not expanded when passed to the optional arguments field. So, use the explicit $HOME path (i.e. /home/hadoop)

Edit (Feb 11th, 2015)

If you want to launch the a streaming job on Amazon EMR using the AWS cli you can use the following command.

aws emr create-cluster  --ami-version '3.3.2' \
                        --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType='m1.medium' InstanceGroupType=CORE,InstanceCount=2,InstanceType='m3.xlarge' \
                        --name 'TestStreamingJob' \
                        --no-auto-terminate \
                        --log-uri 's3://path/to/your/bucket/logs/' \
                        --no-termination-protected \
                        --enable-debugging \
                        --bootstrap-actions Path='s3://path/to/your/bucket/script.sh',Name='ExampleBootstrapScript' Path='s3://path/to/your/bucket/another_script.sh',Name='AnotherExample' \
                        --steps file://./steps_test.json

and you can specify the steps in a JSON file:

[
 {
  "Name": "Avro",
  "Args": ["-files","s3://path/to/your/mapper.py,s3://path/to/your/reducer.py","-libjars","/home/hadoop/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar","-inputformat","org.apache.avro.mapred.AvroAsTextInputFormat","-mapper","mapper.py","-reducer","reducer.py","-input","s3://path/to/your/input_directory/","-output","s3://path/to/your/output_directory/"],
  "ActionOnFailure": "CONTINUE",
  "Type": "STREAMING"
 }
]

(please note that the official Amazon documentation is somewhat outdated, in fact it uses the old Amazon EMR CLI tool which is deprecated in favor of the more recente AWS CLI)

How to load additional JARs for an Hadoop Streaming job on Amazon EMR

1 Answers