How does a hadoop map operation manage with data redundancy on the HDFS cluster?

Question

Since hadoop runs over HDFS, and data is duplicated across the HDFS cluster for redundancy, does a hadoop map operation actually waste a lot of processor cycles by running mappers over same data points, on different nodes in the cluster? (as the nodes, by design, have some data overlap between them, as per the replication level).

Or does it first, depending on some job management strategy of some sort, only address part of the nodes, to avoid that kind of duplicate calculation, in some very clever way?

user3122114 user3122114 · Accepted Answer · 2014-10-26T07:57:02

Each mapper gets an individual InputSplit to process. So if you have 100 InputSplits 100 mappers will be spawned by framework. Then every mapper is going to check if it has all data it needs - if not it will download all required data and begin the computation. One InputSplit will never be assigned twice.

How does a hadoop map operation manage with data redundancy on the HDFS cluster?

1 Answers