Data Locality for Hadoop in Cloud Computing

Question

Currently, Hadoop achieves data locality by assigning tasks to node which contains data or is near the one that does (e.g. same rack). However, I wonder if the same concept can be applied in Cloud computing in which Hadoop is deployed on the set of virtual machines since information regarding physical layers, e.g. which physical machines are currently hosting those VMs, may not be available.

jtravaglini jtravaglini · Accepted Answer · 2014-01-21T14:48:40

In most cloud environments, you lose the data locality benefits of Hadoop entirely, as the storage is typically network attached to your VMs.

There are some virtual extensions to Hadoop that allow one to specify virtual hosts that share the same physical infrastructure (i.e. storage and compute), such that Hadoop can be 'virtual aware' of the underlying hardware -- but these tend to only exist in 1) on-prem private clouds or (more likely) 2) Hadoop PaaS environments.

Data Locality for Hadoop in Cloud Computing

1 Answers