2
votes

I have added a set of jars to the Distributed Cache using the DistributedCache.addFileToClassPath(Path file, Configuration conf) method to make the dependencies available to a map reduce job across the cluster. Now I would like to remove all those jars from the cache to start clean and be sure I have the right jar versions there. I commented out the code that adds the files to the cache and also removed them from where I had copied them in hdfs. The problem is the jars still appear to be in the classpath because the map reduce job is not throwing ClassNotFound exceptions. Is there a way to flush this cache without restarting any services?

Edit: Subsequently I flushed the following folder: /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/ . That did not solve it. The job still finds the references.

1

1 Answers

2
votes

I now understand what my problem was. I had previously copied the jars into the /usr/lib/hadoop/lib/ folder. That made them permanently available to the map reduce job. After removing them from there, the job threw the expected ClassNotFoundException. Also, I noticed that if I do not add the jars with addFileToClassPath they are not available to the job. So there is no need to flush the Distributed Cache or to remove what you have added with addFileToClassPath because what you put there is visible only to that specify job instance.