I have a list of files that I want to add using distributed cache facility. Different files are needed for different reduce tasks. For example, file A is needed by reduce 1, while file B is needed by reduce 2, and so on. In Job Conf, both the files are added using DistributedCache.addCacheFile() method. In the reduce class configure method, I use DistributedCache.getCacheFiles() to get the files. Is it possible that I can have only File A in memory of reduce 1 and only file B in memory of reduce 2. Or the both the files get added to the memory, before the reduce task starts.
If I understand this, I can use distributed cache for my program. My concern is about scalability. The files are big. So the reduce task cannot have both the files in memory. But can hold one of the files.
Pls help!!!
Thanks
kept/or not
in memory as per the requirements. Hadoop framework copies all the cache files to the HDD on the TastTracker and there is a limit of 10GB based on thelocal.cache.size
. – Praveen Sripati