TL;DR
How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)?
Long version
I want to analyze a set of Avro files (> 2000 files) using Hadoop on Amazon Elastic MapReduce (Amazon EMR). It should be a simple exercise through which I should gain some confidence with MapReduce and Amazon EMR (I am new to both).
Since python is my favorite language I have decided to use Hadoop Streaming. I have built a simple mapper and reducer in python, and I have tested it on a local Hadoop (single node install). The command I was issuing on my local Hadoop install was this:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.4.0-amzn-1.jar \
-files avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-libjars avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-input "input" \
-mapper "python2.7 $PWD/mapper.py" \
-reducer "python2.7 $PWD/reducer.py" \
-output "output/outdir" \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
and the job completed successfully.
I have a bucket on Amazon S3 with a folder containing all the input files and another folder with the mapper and reducer scripts (mapper.py
and reducer.py
respectively).
Using the interface I have created a small cluster, then I have added a bootstrap action to install all the required python modules on each node and then I have added an "Hadoop Streaming" step specifying the location of the mapper and reducer scripts on S3.
The problem is that I don't have the slightest idea on how I can upload or specify in the options the two JARs - avro-1.7.7.jar
and avro-mapred-1.7.7.jar
- required to run this job?
I have tried several things:
- using the
-files
flag in combination with-libjars
in the optional arguments; - adding another bootstrap action that downloads the JARs on every node (and I have tried to download it on different locations on the nodes);
- I have tried to upload the JARs on my bucket and specify a full
s3://...
path as argument to-libjars
(note: these file are actively ignored by Hadoop, and a warning is issued) in the options;
If I don't pass the two JARs the job fails (it does not recognize the -inputformat
class), but I have tried all the possibilities (and combinations thereof!) I could think of to no avail.