There were several related questions but I've spent all day trying to figure this one out and the answer wasn't really anywhere in SO so I'm recording it for posterity.
I have a Hadoop installation (CDH 3u6 - Hadoop 0.20.2) in which I wanted to send a map reduce job which had several Jar dependencies. Like most places recommend, I wanted to use the distributed cache to send the dependencies to the data nodes.
Path someHdfsPlace = new Path("my/mr/libs");
FileStatus[] jarFiles = hdfs.listStatus(classpathFilesDir);
for (FileStatus fs : jarFiles) {
DistributedCache.addFileToClassPath(fs.getPath(), job.getConfiguration());
}
I have seen this work on a different Hadoop cluster, and now suddenly it wasn't. The file existed in hdfs and there appeared to be the correct permissions on the files and the directories above it, but any MR code failed as soon as it tried to load a dependency from the lib with a ClassNotFound
error (so not a corruption issue, just these things were not present on the class path.)
One post suggested that one has to set the $HADOOP_CLASSPATH
variable - which might help in some cases, but it isn't clear to me what you would set it to, and in my previous working example I had not had to do that anyway so that seemed unlikely.
Utterly mysterious!