Hadoop: Input and Output paths in AWS EMR job

Question

I am trying to run a Hadoop job in Amazon Elastic Mapreduce. I have my data and jar located in aws s3. When i setup the job flow I pass the JAR Arguments as

s3n://my-hadoop/input s3n://my-hadoop/output

Below is my hadoop main function

public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "MyMR");
        job.setJarByClass(MyMR.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(CountryReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

However my jobflow fails with the following log in stderr

Exception in thread "main" java.lang.ClassNotFoundException: s3n://my-hadoop/input
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:180)

So how do I specify my input and output paths in aws emr?

Amar Amar · Accepted Answer · 2013-02-14T18:51:22

So basically this is a classic error of not-defining-the-main-class while trying to create an executable jar. when you do not let the jar have the knowledge of the main-class, the first argument is taken to be the main-class, and hence the error here.

So make sure that while you create the executable jar, you specify the main-class in the manifest.

OR

Use args[1] and args[2] respectively for input and output and execute the hadoop step something like following:

ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg com.somecompany.MyMainClass --arg s3:/input --arg s3:/output

Hadoop: Input and Output paths in AWS EMR job

2 Answers