Distributed Cache Concept in Hadoop

Question

My question is about the concept of distributed cache specifically for Hadoop and whether it should be called distributed Cache. A conventional definition of distributed cache is "A distributed cache spans multiple servers so that it can grow in size and in transactional capacity".

This is not true in hadoop as Distributed cache is distributed to all the nodes which runs the tasks i.e. the same file mentioned in the driver code.

Shouldn't this be called a replicative cache. The intersection of cache on all nodes should be null (or close to it) if we go by the conventional distributed cache definition. But for hadoop the result of intersection is the same file which is present in all nodes.

Is my understanding correct or am i missing something? Please guide.

Thanks

YoungHobbit YoungHobbit · Accepted Answer · 2015-12-08T04:35:16

The general understanding and concept of any Cache is to make data available in memory and avoid hitting disk for reading the data. Because reading the data from disk is a costlier operation than reading from memory.

Now lets take the same analogy to Hadoop ecosystem. Here disk is your HDFS and memory is local file system where are the actual tasks run. During the life cycle of an application, there may be multiple tasks are running on the same node. So when the first task is launched in the node, it will fetch the data from HDFS and put it in the local system. Now the subsequent tasks on the same node will not fetch the same data again. That way it will save the cost of getting data from HDFS vs getting it from local file system. The is the concept of Distributed Cache in MapReduce framework.

The size of the data is usually small enough that it can be loaded in the Mapper memory, usually in few MBs.

Distributed Cache Concept in Hadoop

2 Answers