I'm learning hadoop, and wrote map/reduce steps to process some avro files I have. I think the problem I am having may be due to my hadoop install though. I am trying to test in standalone mode on my laptop, not on a distributed cluster.
Here is my bash call to run the job:
#!/bin/bash
reducer=/home/hduser/python-hadoop/test/reducer.py
mapper=/home/hduser/python-hadoop/test/mapper.py
avrohdjar=/home/hduser/python-hadoop/test/avro-mapred-1.7.4-hadoop1.jar
avrojar=/home/hduser/hadoop/share/hadoop/tools/lib/avro-1.7.4.jar
hadoop jar ~/hadoop/share/hadoop/tools/lib/hadoop-streaming* \
-D mapreduce.job.name="hd1" \
-libjars ${avrojar},${avrohdjar} \
-files ${avrojar},${avrohdjar},${mapper},${reducer} \
-input ~/tmp/data/* \
-output ~/tmp/data-output \
-mapper ${mapper} \
-reducer ${reducer} \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
And here's the output:
15/04/23 11:02:54 INFO Configuration.deprecation: session.id is
deprecated. Instead, use dfs.metrics.session-id
15/04/23 11:02:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/04/23 11:02:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/04/23 11:02:54 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/home/hduser/tmp/mapred/staging/hduser1337717111/.staging/job_local1337717111_0001
15/04/23 11:02:54 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: hdfs://localhost:54310/home/hduser/hadoop/share/hadoop/tools/lib/avro-1.7.4.jar
Streaming Command Failed!
I've tried a lot of different fixes, but have no idea what to try next. For some reason hadoop can't find the jar files specified by -libjars. Also, I have successfully run the wordcount example posted here, so my hadoop install or configuration works well enough for that. Thanks!
EDIT Here are the changes of the contents of my hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
And here is core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>