0
votes

I have a list of files that I want to add using distributed cache facility. Different files are needed for different reduce tasks. For example, file A is needed by reduce 1, while file B is needed by reduce 2, and so on. In Job Conf, both the files are added using DistributedCache.addCacheFile() method. In the reduce class configure method, I use DistributedCache.getCacheFiles() to get the files. Is it possible that I can have only File A in memory of reduce 1 and only file B in memory of reduce 2. Or the both the files get added to the memory, before the reduce task starts.

If I understand this, I can use distributed cache for my program. My concern is about scalability. The files are big. So the reduce task cannot have both the files in memory. But can hold one of the files.

Pls help!!!

Thanks

1
Distributed cache is not in memory, it is just a confusing name of copying files along with your jar to every host where computation runs.Thomas Jungblut
Thanks for pointing that out. So, we can add a file is as large as the disk space of the node can hold?Mahalakshmi Lakshminarayanan
When the reducer processes the file, is it necessary to hold the entire file in memory?Mahalakshmi Lakshminarayanan
Depends on how the files are processed in the mapper/reducer. Hadoop framework provides hooks to get the list of files in the cache, then the content of the files can be read and kept/or not in memory as per the requirements. Hadoop framework copies all the cache files to the HDD on the TastTracker and there is a limit of 10GB based on the local.cache.size.Praveen Sripati

1 Answers

0
votes

The method for returning the cache files, returns an array of all the names of the files you cached in the order you added them. So it is possible to tell reducer 1 to get the array[0] file and reduce 2 to get the array[1] file. This cache is also recommended not to have very large files in it.