Data Locality in Hadoop – Hadoop In Real World

Data Locality in Hadoop

Hadoop Modes
July 19, 2015
HDFS Block Placement Policy
August 1, 2015

Data Locality in Hadoop

Data Locality in Hadoop refers to the “proximity” of the data with respect to the Mapper tasks working on the data.

Why is Data Locality important?

When a dataset is stored in HDFS, it is divided in to blocks and stored across the DataNodes in the Hadoop cluster. When a MapReduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits). When the data is not available for the Mapper in the same node where it is being executed, the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task.

Imagine a MapReduce job with over 100 Mappers and each Mapper is trying to copy the data from another DataNode in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation.

How is data proximity defined?

When a JobTracker (MRv1) or ApplicationMaster (MRv2) receive a request to run a job, it looks at which nodes in the cluster has sufficient resources to execute the Mappers and Reducers for the job. At this point  serious consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located.

Data Locality In Hadoop

Data Local

When the data is located on the same node as the Mapper working on the data, it is referred to as Data Local. In this case the proximity of the data is closer to the computation. The JobTracker (MRv1) or ApplicationMaster (MRv2)  prefers the node which has the data that is needed by the Mapper to execute the Mapper.

Rack Local

Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data due to resource constraints on a busy cluster. In such instances it is preferred to run the Mapper on a different node but on the same rack as the node which has the data. In this case, the data will be moved between nodes from the node with the data to the node executing the Mapper with in the same rack.

Different Rack

In a busy cluster sometimes Rack Local is also not possible. In that case, a node on a different rack is chosen to execute the Mapper and the data will be copied from the node which has the data to the node executing the Mapper between racks. This is the least preferred scenario.

 

Hadoop Team
Hadoop Team
We are a group of Senior Hadoop Consultants who are passionate about Hadoop and Big Data technologies. Our collective experience ranges from finance, retail, social media and gaming. We have worked with Hadoop clusters ranging from 100 all the way to over 1000 nodes.

1 Comment

  1. […] than the block size. Doing so will decrease the number of mappers but at the expense of sacrificing data locality because now an InputSplit will comprise data from atleast two blocks and both the blocks may not be […]