9
votes

I am trying to set external jars to hadoop classpath but no luck so far.

I have the following setup

$ hadoop version
Hadoop 2.0.6-alpha Subversion https://git-wip-us.apache.org/repos/asf/bigtop.git -r ca4c88898f95aaab3fd85b5e9c194ffd647c2109 Compiled by jenkins on 2013-10-31T07:55Z From source with checksum 95e88b2a9589fa69d6d5c1dbd48d4e This command was run using /usr/lib/hadoop/hadoop-common-2.0.6-alpha.jar

Classpath

$ echo $HADOOP_CLASSPATH
/home/tom/workspace/libs/opencsv-2.3.jar

I am able see the above HADOOP_CLASSPATH has been picked up by hadoop

$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/home/tom/workspace/libs/opencsv-2.3.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/hadoop-mapreduce/.//

Command

$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result

I tried with -libjars option as well

$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result -libjars /home/tom/workspace/libs/opencsv-2.3.jar

The stacktrace

14/11/04 16:43:23 INFO mapreduce.Job: Running job: job_1415115532989_0001 14/11/04 16:43:55 INFO mapreduce.Job: Job job_1415115532989_0001 running in uber mode : false 14/11/04 16:43:56 INFO mapreduce.Job: map 0% reduce 0% 14/11/04 16:45:27 INFO mapreduce.Job: map 50% reduce 0% 14/11/04 16:45:27 INFO mapreduce.Job: Task Id : attempt_1415115532989_0001_m_000001_0, Status : FAILED Error: java.lang.ClassNotFoundException: au.com.bytecode.opencsv.CSVParser at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:19) at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:10) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:757) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)

Any help is highly appreciated.

5
Try this : sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier -libjars /home/tom/workspace/libs/opencsv-2.3.jar /user/root/1987.csv /user/root/resultUSB
Check my answer here, I have explained all the available options to fix this issue here: stackoverflow.com/a/36227260/1766402Isaiah4110

5 Answers

3
votes

Your external jar is missing on the node running maps. You have to add it to the cache to make it available. Try :

DistributedCache.addFileToClassPath(new Path("pathToJar"), conf);

Not sure in which version DistributedCache was deprecated, but from Hadoop 2.2.0 onward you can use :

job.addFileToClassPath(new Path("pathToJar")); 
2
votes

I tried setting the opencsv jar in the hadoop classpath but it didn't work.We need to explicitly copy the jar in the classpath for this to work.It did worked for me. Below are the steps i followed:

I have done this in a HDP CLuster.I ahave copied my opencsv jar in hbase libs and exported it before running my jar

Copy ExternalJars to HDP LIBS:

To Run Open CSV Jar: 1.Copy the opencsv jar in directory /usr/hdp/2.2.9.1-11/hbase/lib/ and /usr/hdp/2.2.9.1-11/hadoop-yarn/lib

**sudo cp  /home/sshuser/Amedisys/lib/opencsv-3.7.jar /usr/hdp/2.2.9.1-11/hbase/lib/**

2.Give the file permissions using sudo chmod 777 opencsv-3.7.jar 3.List Files ls -lrt

4.exporthadoop classpath:hbase classpath

5.Now run your Jar.It will pick up the opencsv jar and will execute properly.

1
votes

If you are adding the external JAR to the Hadoop classpath then its better to copy your JAR to one of the existing directories that hadoop is looking at. On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.

Looks like you tried the LIBJARs option as well, did you edit your driver class to implement the TOOL interface? First make sure that you edit your driver class as shown below:

    public class myDriverClass extends Configured implements Tool {

      public static void main(String[] args) throws Exception {
         int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
         System.exit(res);
      }

      public int run(String[] args) throws Exception
      {

        // Configuration processed by ToolRunner 
        Configuration conf = getConf();
        Job job = new Job(conf, "My Job");

        ...
        ...

        return job.waitForCompletion(true) ? 0 : 1;
      }
    }

Now edit your "hadoop jar" command as shown below:

hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file

Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.

Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.

Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.

Configuration conf = getConf();
0
votes

I found another workaround by implementing ToolRunner like below. With this approach hadoop accepts command line options. We can avoid hard coding of adding files to DistributedCache

 public class FlightsByCarrier extends Configured implements Tool {

       public int run(String[] args) throws Exception {
         // Configuration processed by ToolRunner
         Configuration conf = getConf();

         // Create a JobConf using the processed conf
         JobConf job = new JobConf(conf, FlightsByCarrier.class);

         // Process custom command-line options
         Path in = new Path(args[1]);
         Path out = new Path(args[2]);

         // Specify various job-specific parameters     
         job.setJobName("my-app");
         job.setInputPath(in);
         job.setOutputPath(out);
         job.setMapperClass(MyMapper.class);
         job.setReducerClass(MyReducer.class);

         // Submit the job, then poll for progress until the job is complete
         JobClient.runJob(job);
         return 0;
       }

       public static void main(String[] args) throws Exception {
         // Let ToolRunner handle generic command-line options 
         int res = ToolRunner.run(new Configuration(), new FlightsByCarrier(), args);

         System.exit(res);
       }
     }
0
votes

I found a very easy solution to the problem: login as root :

cd /usr/lib find . -name "opencsv.jar"

Pick up the locatin of the file. In my case >I found it under /usr/lib/hive/lib/opencsv*.jar

Now submit the command

hadoop classpath

The result shows the direcory where hadoop searches jar-files. Pick up one directory and copy opencsv*jar to that directory.

In my case it worked.