Can Reducer always be reused for Combiner? - Big Data In Real World

Can Reducer always be reused for Combiner?

HDFS Federation
August 30, 2015
Dealing With Data Corruption In HDFS
September 6, 2015
HDFS Federation
August 30, 2015
Dealing With Data Corruption In HDFS
September 6, 2015

Can Reducer always be reused for Combiner?

A Combiner function is an optional intermediary function which is executed on the Map phase right after the execution of the Mapper is complete. There are 2 primary benefits to use a combiner –

  1. Combiners can be used to reduce the amount of data sent to the reducer which increases network efficiency.
  2. Combiners can be used to reduce the amount of data sent to the reducer and this will improve the efficiency at the reduce side since each reduce function will be presented with less amount of records to process.

The signature of the Combiner program is same as the Reducer since both Combiner and Reducer process on the output of the Mapper. Which gives us a great opportunity to reuse the Reducer program as the Combiner.

But the question is, is it always a good idea to reuse reducer program for combiner?

Reducer for Combiner – Good use case

Let’s say we are writing a MapReduce program to calculate maximum closing price for each symbol from a stocks dataset. The mapper program will emit the symbol as the key and closing price as the value for each stock record from the dataset. The reducer will be called once for each stock symbol and with a list of closing prices. The reducer will then loop through all the closing prices for the symbol and will calculate the maximum closing price from the list of closing prices for that symbol.

Assume Mapper 1 processed 3 records for symbol ABC with closing prices – 50, 60 and 111. Let’s also assume that Mapper 2 processed 2 records for symbol ABC with closing prices – 100 and 31.

Now the reducer will receive five closing prices for symbol ABC –  50, 60, 111, 100 and 31. The job of the reducer is very simple it will simply loop through all the 5 closing prices and will calculate the maximum closing price to be 111.

We can use the same reducer program for combiner after each Mapper. The combiner on mapper 1 will process 3 closing prices –  50, 60 and 111 and will emit only 111 since it is the maximum closing price of the 3 values which is 111.  The combiner on mapper 2 will process 2 closing prices –  100 and 31 and will emit only 100 since it is the maximum closing price of the 2 values which is 100.

Now with combiner reducer will only process 2 closing prices for symbol ABC which is 111 from Mapper 1 and 100 from Mapper 2 and will calculate the maximum closing price as 111 from both the values.

As we can see the output is the same with and with out the combiner hence in this case reusing the reducer as a combiner worked with no issues.

 

Reducer for Combiner – Bad use case

Let’s say we are writing a MapReduce program to calculate the average volume for each symbol from a stocks dataset. The mapper program will emit the symbol as the key and volume as the value for each stock record from the dataset. The reducer will be called once for each stock symbol and with a list of volumes. The reducer will then loop through all the volumes for the symbol and will calculate the average volume from the list of volumes for that symbol.

Assume Mapper 1 processed 3 records for symbol ABC with volumes – 50, 60 and 111. Let’s also assume that Mapper 2 processed 2 records for symbol ABC with volumes – 100 and 31.

Now the reducer will receive five volume values for symbol ABC –  50, 60, 111, 100 and 31. The job of the reducer is very simple it will simply loop through all the 5 volumes and will calculate the average volume to be 70.4

50 + 60 + 111 + 100 + 31 /  5 = 352 / 5 = 70.4

Let’s see what happens if we use the same reducer program as combiner after each Mapper. The combiner on mapper 1 will process 3 volumes –  50, 60 and 111 and will calculate the average of the 3 volumes 73.66

The combiner on mapper 2 will process 2 volumes –  100 and 31 and will calculate the average volume of the 2 values which is 65.5.

Now with the combiner in place, reducer will only process 2 average volumes for symbol ABC which is  73.66 from Mapper 1 and 65.5 from Mapper 2 and will calculate the average volume of symbol ABC as 73.66 + 65.5 /2  = 69.58 which is incorrect as the correct average volume is 70.4

So as we can see Reducer can not always be reused for Combiner. So when ever you decide to reuse reducer for combiner ask yourself this question – will my output be the same with and without the use of combiner ?

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Can Reducer always be reused for Combiner?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X