2
votes

I'm getting an error I think that has to do with how I set up my directory:

after running:

hadoop-0.20.205.0/bin/hadoop jar hadoop-0.20.205.0/contrib/streaming/hadoop-streaming-*.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input cs4501input -output py_wc_out

I get: packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-ubuntu/hadoop-unjar6120166906857088018/] [] /tmp/streamjob1341652915014758694.jar tmpDir=null

12/04/08 01:34:01 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-ubuntu/mapred/staging/ubuntu/.staging/job_201204080100_0004

12/04/08 01:34:01 ERROR streaming.StreamJob: Error launching job , Output path already exists : Output directory hdfs://localhost:9000/user/ubuntu/py_wc_out already exists Streaming Job Failed!

I think it has to do with when I specified the core-site.xml file with hdfs, but that was in the quick start guide. I don't understand why I need to specify hdfs next to the localhost address with port number.

1

1 Answers

4
votes

The problem is you are trying to run the same job without cleaning out your output directory. Delete the output directory first, then rerun it. You'll have to do this between every job. Hadoop fails instead of letting you overwrite the directory.

hadoop fs -rmr /user/ubuntu/py_wc_out

Personally, the way I like to get around this "problem", is to attach a timestamp to the output directory on the fly. This way it'll always be unique and you don't have to get rid of previous runs.

hadoop-0.20.205.0/bin/hadoop jar ... -output py_wc_out-`date +%s`