I am new to Hadoop and I am trying to figure out a way to do the following:
- I have multiple input image files.
- I have binary executables that processes these files.
- These binary executables write text files as output.
- I have a folder that contains all of these executables.
- I have a script which runs all of these executables in certain order, passing image location as arguments.
My question is this: can I use Hadoop streaming to process these images via these binaries and spit out the results from the text file.
I am currently trying this.
I have my Hadoop cluster running. I uploaded by binaries and my images onto the HDFS.
I have set up a scrip which, when hadoop runs should change directory into the folder with images and execute another script which executes the binaries.
Then the scrip spits out via stdout the results.
However, I can't figure out how to have my map script change into the image folder on HDFS and then execute the other script.
Can someone give me a hint?
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/7posLarge \
-output /user/hduser/output5 \
-mapper RunHadoopJob.sh \
-verbose
And my RunHadoopJob.sh:
#!/bin/bash
cd /user/hduser/7posLarge/;
/user/hduser/RunSFM/RunSFM.sh;
My HDFS looks like this:
hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:32 /user/hduser/7posLarge
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:39 /user/hduser/RunSFM
drwxr-xr-x - root supergroup 0 2012-11-30 14:32 /user/hduser/output5
I know this is not the standard use of MapReduce. I am simply looking for a way to easily, without writing much overhead spin up multiple jobs on different clusters of the same program with different input. It seems like this is possible looking at Hadoop Streaming documentation.
"How do I use Hadoop Streaming to run an arbitrary set of (semi-)independent tasks?
Often you do not need the full power of Map Reduce, but only need to run multiple instances of the same program - either on different parts of the data, or on the same data, but with different parameters. You can use Hadoop Streaming to do this. "
If this is not possible, is there another tool on AmazonAWS for example that can do this for me?
UPDATE: Looks like there are similar solutions but I have trouble following them. They are here and here.