EMR Streaming job using Java code for mapper and reducer

Question

I currently have streaming jobs run with mapper and reducer code written in ruby. I want to convert these to java. I do not know how to run a streaming job with EMR hadoop using java. The sample given in amazon's EMR website of cloudburst is too complex. Following are the details of how I run the jobs currently.

Code to start a job:

        elastic-mapreduce --create --alive --plain-output --master-instance-type m1.small 
--slave-instance-type m1.xlarge --num-instances 2  --name "Job Name" --bootstrap-action 
    s3://bucket-path/bootstrap.sh

Code to add a step:

    elastic-mapreduce -j <job_id> --stream --step-name "my_step_name" 
--jobconf mapred.task.timeout=0 --mapper s3://bucket-path/mapper.rb 
--reducer s3://bucket-path/reducerRules.rb --cache s3://bucket-path/cache/cache.txt 
--input s3://bucket-path/input --output s3://bucket-path/output

Mapper code reads from a csv file which is mentioned above as EMR's cache argument as well as it reads from the input s3 bucket which also has some csv files, does some calculations and prints a csv output lines to standard output.

//mapper.rb 
CSV_OPTIONS  = {
  // some CSV options
}

begin
    file = File.open("cache.txt")
    while (line = file.gets)
        // do something
    end
    file.close
end

input  = FasterCSV.new(STDIN, CSV_OPTIONS)
input.each{ 
// do calculations and get result
puts (result)
}

//reducer.rb

$stdin.each_line do |line|
// do some aggregations and get aggregation_result
if(some_condition) puts(aggregation_result)
end

Matthew Rathbone Matthew Rathbone · Accepted Answer · 2012-04-23T22:30:04

You don't use streaming if you're using java. You build a Jar directly against the MapReduce API.

Check out the examples folder of the hadoop source for some good examples of how to do this, including the infamous wordcount: https://github.com/apache/hadoop/tree/trunk/src/examples/org/apache/hadoop/examples

I'm not totally sure why you want to use Java, but coding directly to the API's can be painful. You might want to try one of the following: Java Projects:

Cascading http://www.cascading.org/
Crunch http://www.cloudera.com/blog/2011/10/introducing-crunch/

Non-Java:

Hive (sql-like) https://cwiki.apache.org/confluence/display/Hive/Home
Pig http://pig.apache.org/#Getting+Started
Scoobi (scala) https://github.com/NICTA/scoobi

FWIW I think Pig would probably be my pick, and is supported out of the box on EMR.

EMR Streaming job using Java code for mapper and reducer

2 Answers