Hadoop DistributedCache causes IOException

Question

I have written some code to do a selfjoin task using Hadoop. For this purpose I use DistributedCache class.When I run the code locally in Netbeans the job is done correctly but when I try to run it in a single node cluster after I upload the data in hdfs I get the following Exception:

Error initializing attempt_201301021509_0002_m_000002_0:
java.io.IOException: Distributed cache entry arrays have different lengths: 1, 2, 1, 1
    at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(JobLocalizer.java:316)
    at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(JobLocalizer.java:343)
    at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:388)
    at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:367)
    at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
    at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1228)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1203)
    at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1118)
    at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2430)
    at java.lang.Thread.run(Thread.java:679)

I understand that the problem is in the JobLocalizer.java and the DistributedCache.getLocalCacheFiles(conf) that returns 2 but I don't know the reason why this happen. Could anyone tell me what I don't get?

PS: I forgot to mention that I use Hadoop-1.0.4

PS2: The problem is that DistributedCache.getLocalCacheFiles(conf) sees the real input file and also a temp file that is the same as the input file and is temporarily stored in /tmp folder. That happens when I run it locally (which does not throw any exception). I guess sth similar happens when I run it from hdfs but then it throws the exception. Is there any ideas how could I fix this?

Amar Amar · Accepted Answer · 2013-01-02T18:40:34

It is probably happening as you are providing the local path to the file rather than moving the file to HDFS and then providing the HDFS path. Also I believe you are trying it out locally and running hadoop in pseudo-distributed mode.

In order to move a file to hdfs you can do something like following :

$ hadoop fs -put <your-file-path> <someHDFSfoldername/filename>

And then add someHDFSfoldername/filename in your Distributed cache.

EDIT: Looking at the code here , it happens when there is a mismatch in the number of source and destination files. Following code segment from JobLocalizer.java is getting you the error:

if (sources.length != dests.length ||
        sources.length != times.length ||
        sources.length != isPublic.length) {
      throw new IOException("Distributed cache entry arrays have different " +
                            "lengths: " + sources.length + ", " + dests.length +
                            ", " + times.length + ", " + isPublic.length);
    }

If you get us more info on how are you adding the cache files and how are you accessing them it would help.

Hadoop DistributedCache causes IOException

1 Answers