Data Locality in Hadoop refers to the “proximity” of the data with respect to the Mapper tasks working on the data.
When a dataset is stored in HDFS, it is divided in to blocks and stored across the DataNodes in the Hadoop cluster. When a MapReduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits). When the data is not available for the Mapper in the same node where it is being executed, the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task.
Imagine a MapReduce job with over 100 Mappers and each Mapper is trying to copy the data from another DataNode in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation.
When a JobTracker (MRv1) or ApplicationMaster (MRv2) receive a request to run a job, it looks at which nodes in the cluster has sufficient resources to execute the Mappers and Reducers for the job. At this point serious consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located.
When the data is located on the same node as the Mapper working on the data, it is referred to as Data Local. In this case the proximity of the data is closer to the computation. The JobTracker (MRv1) or ApplicationMaster (MRv2) prefers the node which has the data that is needed by the Mapper to execute the Mapper.
Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data due to resource constraints on a busy cluster. In such instances it is preferred to run the Mapper on a different node but on the same rack as the node which has the data. In this case, the data will be moved between nodes from the node with the data to the node executing the Mapper with in the same rack.
In a busy cluster sometimes Rack Local is also not possible. In that case, a node on a different rack is chosen to execute the Mapper and the data will be copied from the node which has the data to the node executing the Mapper between racks. This is the least preferred scenario.