0
votes

I have written a mapreduce application for hadoop and tested it at the command line on a single machine. My application uses two steps Map1 -> Reduce1 -> Map2 -> Reduce2 To run this job on aws mapreduce, I am following this link http://aws.amazon.com/articles/2294. But I am not clear how to use Ruby CLI client provide by amazon to do all the work described. Please guide.

Thanks.

1
The description of the question doesn't exactly match the title. I tried to answer the question as it's described. Perhaps you want to expand on the JSON part?Ronen Botzer

1 Answers

0
votes

You start by creating the default streaming jobflow (which runs the wordcount example). At this point you use the jobflow ID to add your other steps. In my example, the first mapreduce job stores its results in an S3 bucket. That result then becomes the input for the second job. If you go into the AWS console you'll see these under the Steps tab.

You can keep chaining jobs in this way, since the --alive flag makes sure the cluster doesn't shut down until you manually terminate it. Just remember to do so when the last step is completed (the jobflow will return to the WAITING state), otherwise you'll get charged for the idle time.

$ elastic-mapreduce --create --alive --stream --num-instances=1 --master-instance-type=m1.small

Created job flow j-NXXXJARJARSXXX
$ elastic-mapreduce -j j-NXXXJARJARSXXX --stream \
 --input   s3n://mybucket.data/2011/01/01/* \
 --output  s3n://mybucket.joblog/step1done-2011-01-01 \
 --mapper  s3n://mybucket.code/map.rb \
 --reducer s3n://mybucket.code/reduce.rb

Added jobflow steps
$ elastic-mapreduce -j j-NXXXJAJARSXXX --stream \
 --input   s3n://mybucket.joblog/step1done-2011-01-01/part-*  \
 --output  s3n://mybucket.joblog/job-results \
 --mapper  s3n://mybucket.code/map.rb \
 --reducer s3n://mybucket.code/reduce.rb

Added jobflow steps