5
votes

I am trying to figure out how to set class path that reference to HDFS? I cannot find any reference.

 java -cp "how to reference to HDFS?" com.MyProgram 

If i cannot reference to hadoop file system, then i have to copy all the referenced third party libs/jars somewhere under $HADOOP_HOME on each hadoop machine...but i wanna avoid this by putting files to hadoop file system. Is this possible?

Example hadoop command line for the program to run (my expectation is like this, maybe i am wrong):

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar -input inputfileDir -output outputfileDir -mapper /home/nanshi/myprog.java -reducer NONE -file /home/nanshi/myprog.java

However, within the command line above, how do i added java classpath? like -cp "/home/nanshi/wiki/Lucene/lib/lucene-core-3.6.0.jar:/home/nanshi/Lucene/bin"

3

3 Answers

11
votes

What I suppose you are trying to do is include third party libraries in your distributed program. There are many options you can do.

Option 1) Easiest option that I find is to put all the jars in $HADOOP_HOME/lib (eg /usr/local/hadoop-0.22.0/lib) directory on all nodes and restart your jobtracker and tasktracker.

Option 2) Use libjars option command for this is hadoop jar -libjars comma_seperated_jars

Option 3) Include the jars in lib directory of the jar. You will have to do that while creating your jar.

Option 4) Install all the jars in your computer and include their location in class path.

Option 5) You can try by putting those jars in distributed cache.

3
votes

You cannot add to your classpath a HDFS path. The java executable wouldn't be able to interpret something like :

hdfs://path/to/your/file

But adding third party libraries to the classpath of each task needing those libraries can be done using the -libjars option. This means you need to have a so called driver class (implementing Tool) which sets up and starts your job and use the -libjars option on the command line when running that driver class. The Tool, in turn, uses GenericParser to parse your command line arguments (including -libjars) and with the help of the JobClient will do all the necessary work to send your lib to all the machines needing them and to set them on the classpath of those machines.

Besides that, in order to run a MR job you should use the hadoop script located in the bin/ directory of your distribution.

Here is an example (using a jar containing your job and the driver class):

 hadoop jar jarfilename.jar DriverClassInTheJar 
 -libjars comma-separated-list-of-libs <input> <output>
2
votes

You can specify the jar path as
-libjars hdfs://namenode/path_to_jar ,I have used this with Hive .