0
votes

There were several related questions but I've spent all day trying to figure this one out and the answer wasn't really anywhere in SO so I'm recording it for posterity.

I have a Hadoop installation (CDH 3u6 - Hadoop 0.20.2) in which I wanted to send a map reduce job which had several Jar dependencies. Like most places recommend, I wanted to use the distributed cache to send the dependencies to the data nodes.

 Path someHdfsPlace = new Path("my/mr/libs");
 FileStatus[] jarFiles = hdfs.listStatus(classpathFilesDir);
 for (FileStatus fs : jarFiles) {
      DistributedCache.addFileToClassPath(fs.getPath(), job.getConfiguration());
 }

I have seen this work on a different Hadoop cluster, and now suddenly it wasn't. The file existed in hdfs and there appeared to be the correct permissions on the files and the directories above it, but any MR code failed as soon as it tried to load a dependency from the lib with a ClassNotFound error (so not a corruption issue, just these things were not present on the class path.)

One post suggested that one has to set the $HADOOP_CLASSPATH variable - which might help in some cases, but it isn't clear to me what you would set it to, and in my previous working example I had not had to do that anyway so that seemed unlikely.

Utterly mysterious!

1

1 Answers

0
votes

For me, at least, the answer is in this bug report:

https://issues.apache.org/jira/browse/MAPREDUCE-1581

The path might be coming across as a fully qualified path : hdfs://host:2456/my/mr/libs/myJar.jar which, in some environments where : is the path separator character, will lead to a munged set of files hdfs, //host, and 2456/my/mr/libs/myJar.jar, none of which will result in the right file being added to the class path.

The 2nd solution posted in the bug report worked for me - disqualify the path like so:

 Path someHdfsPlace = new Path("my/mr/libs");
 FileStatus[] jarFiles = hdfs.listStatus(classpathFilesDir);
 for (FileStatus fs : jarFiles) {
      Path disqualified = new Path(fs.getPath().toUri().getPath());
      DistributedCache.addFileToClassPath(disqualified, job.getConfiguration());
 }