I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)
Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.
If a developer writes their own input format would they be responsible for implementing data locality?