I have written some code to do a selfjoin task using Hadoop. For this purpose I use DistributedCache class.When I run the code locally in Netbeans the job is done correctly but when I try to run it in a single node cluster after I upload the data in hdfs I get the following Exception:
Error initializing attempt_201301021509_0002_m_000002_0:
java.io.IOException: Distributed cache entry arrays have different lengths: 1, 2, 1, 1
at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(JobLocalizer.java:316)
at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(JobLocalizer.java:343)
at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:388)
at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:367)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1228)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1203)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1118)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2430)
at java.lang.Thread.run(Thread.java:679)
I understand that the problem is in the JobLocalizer.java and the DistributedCache.getLocalCacheFiles(conf)
that returns 2 but I don't know the reason why this happen. Could anyone tell me what I don't get?
PS: I forgot to mention that I use Hadoop-1.0.4
PS2: The problem is that DistributedCache.getLocalCacheFiles(conf)
sees the real input file and also a temp file that is the same as the input file and is temporarily stored in /tmp folder. That happens when I run it locally (which does not throw any exception). I guess sth similar happens when I run it from hdfs but then it throws the exception. Is there any ideas how could I fix this?