Requiring external libraries in ruby streaming scripts for Amazon EMR

Question

How do I require external libraries when running Amazon EMR streaming jobs written in Ruby?

I've defined my mapper, and am getting this output in my logs:

/mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201008110139_0001/attempt_201008110139_0001_m_000000_0/work/./mapper_stage1.rb: line 1: require: command not found

My first reaction is that either the streaming jar isn't realizing that its executing a ruby script (I've got a shebang declaration at the top of the script pointing to /usr/bin/ruby) or that there's something funky going on with the way the streaming API deals with referencing external libraries.

looks like it's not being executed by ruby. You could try adding something like puts RUBY_VERSION at the top... — rogerdpack
ThatS precisely what the issue was -- it was executing my ruby script through sh. Solved that particular issue by explicitly declaring a ruby interpreter when firing up the job from the cms line tool (ie: --mapper 'ruby s3://mybucket/mymapper.rb Will update when I actually get it running successfully -- facing a couple other issues at present. Thanks for the pointer though! — isparling
If you use #!/usr/bin/env ruby the script will execute using the first ruby interpreter found on the PATH. — Ronen Botzer

Ronen Botzer Ronen Botzer · Accepted Answer · 2011-03-22T20:49:07

Currently in Amazon Elastic Mapreduce, /usr/bin/ruby is a symbolic link pointing to /usr/bin/ruby1.8. This is a dangerous interpreter to use, as it is ancient and buggy.

$ /usr/bin/ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux]

If you're using one of the 64bit instances (like m1.xlarge) you can install Ruby Enterprise Edition in a bootstrap action. This goes into /usr/local/bin which has a higher path resolution precedence than the stock Ruby1.8, so service-nanny (which shebangs /usr/bin/ruby) still works, while your scripts can run on an interpreter that has been built in 2011, with a much higher patchlevel.

Requiring external libraries in ruby streaming scripts for Amazon EMR

1 Answers