0
votes

How do I require external libraries when running Amazon EMR streaming jobs written in Ruby?

I've defined my mapper, and am getting this output in my logs:

/mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201008110139_0001/attempt_201008110139_0001_m_000000_0/work/./mapper_stage1.rb: line 1: require: command not found

My first reaction is that either the streaming jar isn't realizing that its executing a ruby script (I've got a shebang declaration at the top of the script pointing to /usr/bin/ruby) or that there's something funky going on with the way the streaming API deals with referencing external libraries.

1
looks like it's not being executed by ruby. You could try adding something like puts RUBY_VERSION at the top... - rogerdpack
ThatS precisely what the issue was -- it was executing my ruby script through sh. Solved that particular issue by explicitly declaring a ruby interpreter when firing up the job from the cms line tool (ie: --mapper 'ruby s3://mybucket/mymapper.rb Will update when I actually get it running successfully -- facing a couple other issues at present. Thanks for the pointer though! - isparling
If you use #!/usr/bin/env ruby the script will execute using the first ruby interpreter found on the PATH. - Ronen Botzer

1 Answers

0
votes

Currently in Amazon Elastic Mapreduce, /usr/bin/ruby is a symbolic link pointing to /usr/bin/ruby1.8. This is a dangerous interpreter to use, as it is ancient and buggy.

$ /usr/bin/ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux]

If you're using one of the 64bit instances (like m1.xlarge) you can install Ruby Enterprise Edition in a bootstrap action. This goes into /usr/local/bin which has a higher path resolution precedence than the stock Ruby1.8, so service-nanny (which shebangs /usr/bin/ruby) still works, while your scripts can run on an interpreter that has been built in 2011, with a much higher patchlevel.