InputSplit vs Block - Big Data In Real World

InputSplit vs Block

HDFS Block Placement Policy
August 1, 2015
Changing Number Of Mappers
August 9, 2015

InputSplit vs Block

The central idea behind MapReduce is distributed processing and hence the most important thing is to divide the dataset in to chunks and you have separate process working on the dataset on every chunk of data.

Lets assign some technical jargons now. The chunks are called input splits and the process working on the chunks (InputSplits) are called Mappers.

Are InputSplits Same As Blocks?

InputSplit is not the same as the block.

A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the dataset will be 128 MB except for the last block which could be less than the block size if the file size is not entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a logical record ends.

Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. huge records)

So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block size which is 128 MB. However the 2nd record can not fit in the block, so the record number 2 will start in block 1 and will end in block 2.

If you assign a mapper to a block 1, in this case, the Mapper can not process Record 2 because block 1 does not have the complete record 2. That is exactly the problem InputSplit solves. In this case InputSplit 1 will have both record 1 and record 2. InputSplit 2 does not start with Record 2 since Record 2 is already included in the Input Split 1. So InputSplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still InputSplit 2 will have the whole of record 3.

Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of data.  It is a Java class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows where to start reading and where to stop reading. The start location of an InputSplit can start in a block and end in another block.

InputSplit respect logical record boundary and that is why it becomes very important.   During MapReduce execution Hadoop scans through the blocks and create InputSplits and each InputSplit will be assigned to individual mappers for processing.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

gdpr-image
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X