0
votes

I am new to Hadoop and I am trying to figure out a way to do the following:

  1. I have multiple input image files.
  2. I have binary executables that processes these files.
  3. These binary executables write text files as output.
  4. I have a folder that contains all of these executables.
  5. I have a script which runs all of these executables in certain order, passing image location as arguments.

My question is this: can I use Hadoop streaming to process these images via these binaries and spit out the results from the text file.

I am currently trying this.

I have my Hadoop cluster running. I uploaded by binaries and my images onto the HDFS.

I have set up a scrip which, when hadoop runs should change directory into the folder with images and execute another script which executes the binaries.

Then the scrip spits out via stdout the results.

However, I can't figure out how to have my map script change into the image folder on HDFS and then execute the other script.

Can someone give me a hint?

    sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input  /user/hduser/7posLarge \
-output /user/hduser/output5 \
-mapper RunHadoopJob.sh  \
-verbose

And my RunHadoopJob.sh:

#!/bin/bash
cd /user/hduser/7posLarge/;
/user/hduser/RunSFM/RunSFM.sh;

My HDFS looks like this:

hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 4 items
drwxr-xr-x   - hduser supergroup          0 2012-11-28 17:32 /user/hduser/7posLarge
drwxr-xr-x   - hduser supergroup          0 2012-11-28 17:39 /user/hduser/RunSFM
drwxr-xr-x   - root   supergroup          0 2012-11-30 14:32 /user/hduser/output5

I know this is not the standard use of MapReduce. I am simply looking for a way to easily, without writing much overhead spin up multiple jobs on different clusters of the same program with different input. It seems like this is possible looking at Hadoop Streaming documentation.

"How do I use Hadoop Streaming to run an arbitrary set of (semi-)independent tasks?

Often you do not need the full power of Map Reduce, but only need to run multiple instances of the same program - either on different parts of the data, or on the same data, but with different parameters. You can use Hadoop Streaming to do this. "

If this is not possible, is there another tool on AmazonAWS for example that can do this for me?

UPDATE: Looks like there are similar solutions but I have trouble following them. They are here and here.

2

2 Answers

0
votes

There are several issues when dealing with Hadoop-streaming and binary files:

  • Hadoop doesn't know itself how to process image files
  • mappers are taking the input from the stdin line by line so you need to create an intermediate shell script that writes the image data from the stdin to some temp. file which is then passed to the executable.

Just passing the directory location to the executables is not really efficient since in this case you'll loose data locality. I don't want to repeat the already well answered questions on this topic, so here are the links:
Using Amazon MapReduce/Hadoop for Image Processing
Hadoop: how to access (many) photo images to be processed by map/reduce?

Another approach would be to transform the image files into splittable SequenceFiles. I.e: each record would be one image in the SequenceFile. Then using this as input format the mappers would call the executables on each record they get. Note that you have to provide them to the TaskTracker nodes beforehand with the correct file permissions so that they are executable from java code.
Some more information on this topic:
Hadoop: Example process to generating a SequenceFile with image binaries to be processed in map/reduce

0
votes

I was able to use a "hack" to have a prototype of a workaround.

I am still trying this out, and I don't think this will work on an elastic cluster since you would have to recompile your binaries depending on your cluster's OS architecture. But, if you have a private cluster this may be a solution.

Using hadoop streaming you can package your binaries in .jar files and ship them to the node, where they will get unpacked before your script runs.

I have my images in pics.jar and my program which processes all images found in the directory from where you start the program in BinaryProgramFolder.jar. Inside the folder I have a script which launcher the program.

My streaming job ships the images and a the binary program + scripts to the node and starts them. Again this is a workaround hack...not a "real" solution to the problem.

So,

sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
    -archives 'hdfs://master:54310/user/hduser/pics.jar#pics','hdfs://master:54310/user/hduser/BinaryProgramFolder.jar#BinaryProgramFolder' \
    -numReduceTasks 0 \
    -file /home/hduser/RunHadoopJob.sh \
    -input  /user/hduser/input.txt \
    -output /user/hduser/output \
    -mapper RunHadoopJob.sh  \
    -verbose

Filler input file text.txt:

Filler text for streaming job.

RunHadoopJob.sh

cp -Hr BinaryProgramFolder ./pics; #copy a sym link to your unpacked program folder into your pics directory.
cd ./pics;
./BinaryProgramFolder/BinaryProgramLauncScript.sh <params>; #lunch your program following the symlink to the programs folder, I also used a script to launch my bin program which was in the same folder as the launch script.

NOTE: you must first put your program and images into a jar archive and then copy them to the HDFS. Use hadoop fs -copyFromLocal ./<file location> ./<hadoop fs location>