Use Hadoop Streaming to run binary via script

Question

I am new to Hadoop and I am trying to figure out a way to do the following:

I have multiple input image files.
I have binary executables that processes these files.
These binary executables write text files as output.
I have a folder that contains all of these executables.
I have a script which runs all of these executables in certain order, passing image location as arguments.

My question is this: can I use Hadoop streaming to process these images via these binaries and spit out the results from the text file.

I am currently trying this.

I have my Hadoop cluster running. I uploaded by binaries and my images onto the HDFS.

I have set up a scrip which, when hadoop runs should change directory into the folder with images and execute another script which executes the binaries.

Then the scrip spits out via stdout the results.

However, I can't figure out how to have my map script change into the image folder on HDFS and then execute the other script.

Can someone give me a hint?

    sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input  /user/hduser/7posLarge \
-output /user/hduser/output5 \
-mapper RunHadoopJob.sh  \
-verbose

And my RunHadoopJob.sh:

#!/bin/bash
cd /user/hduser/7posLarge/;
/user/hduser/RunSFM/RunSFM.sh;

My HDFS looks like this:

hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 4 items
drwxr-xr-x   - hduser supergroup          0 2012-11-28 17:32 /user/hduser/7posLarge
drwxr-xr-x   - hduser supergroup          0 2012-11-28 17:39 /user/hduser/RunSFM
drwxr-xr-x   - root   supergroup          0 2012-11-30 14:32 /user/hduser/output5

I know this is not the standard use of MapReduce. I am simply looking for a way to easily, without writing much overhead spin up multiple jobs on different clusters of the same program with different input. It seems like this is possible looking at Hadoop Streaming documentation.

"How do I use Hadoop Streaming to run an arbitrary set of (semi-)independent tasks?

Often you do not need the full power of Map Reduce, but only need to run multiple instances of the same program - either on different parts of the data, or on the same data, but with different parameters. You can use Hadoop Streaming to do this. "

If this is not possible, is there another tool on AmazonAWS for example that can do this for me?

UPDATE: Looks like there are similar solutions but I have trouble following them. They are here and here.

Lorand Bendig Lorand Bendig · Accepted Answer · 2012-12-01T22:27:02

There are several issues when dealing with Hadoop-streaming and binary files:

Hadoop doesn't know itself how to process image files
mappers are taking the input from the stdin line by line so you need to create an intermediate shell script that writes the image data from the stdin to some temp. file which is then passed to the executable.

Just passing the directory location to the executables is not really efficient since in this case you'll loose data locality. I don't want to repeat the already well answered questions on this topic, so here are the links:
Using Amazon MapReduce/Hadoop for Image Processing
Hadoop: how to access (many) photo images to be processed by map/reduce?

Another approach would be to transform the image files into splittable SequenceFiles. I.e: each record would be one image in the SequenceFile. Then using this as input format the mappers would call the executables on each record they get. Note that you have to provide them to the TaskTracker nodes beforehand with the correct file permissions so that they are executable from java code.
Some more information on this topic:
Hadoop: Example process to generating a SequenceFile with image binaries to be processed in map/reduce

Use Hadoop Streaming to run binary via script

2 Answers