Pig vs. Hive

Apache Pig takes in a set of instructions written in Pig Latin, compiles them and produce a set of MapReduce jobs and execute all those MapReduce jobs in Hadoop cluster.

Apache Hive takes in a “SQL like” query as input, compiles them and produce a set of MapReduce jobs and execute all those MapReduce jobs in Hadoop cluster.

Both Apache Pig and Hive are widely used in Hadoop environments and must know tools for any aspiring Hadoop developer. If you look at the above two descriptions for the tools you will see they sound a lot similar which raises the following questions –

Why do we have two tools performing somewhat similar operation in Hadoop ecosystem?
Does Pig and Hive co-exist in Hadoop production environments?
Which tool is better – Pig or Hive?

Why do we have two tools performing somewhat similar operation in Hadoop ecosystem?

Pig and Hive were developed by Yahoo and Facebook respectively to solve the same problem (i.e. to make Hadoop easily accessible for non programmers) around the same time. The capabilities of either tool were not fully transparent to both companies at the early stages of development which resulted in the overlap.

Does Pig and Hive co-exist in Hadoop production environments?

The answer is yes. We have seen successful Hadoop implementations using both Pig and Hive in the same environment.

Here is one such use case – you can use pig for standard nightly Extract Transform and Load (ETL) kind of jobs doing predefined aggregation, data clean up, filtering and structuring etc. and Hive can be used by developers, data analysts and scientists on a day to day basis for adhoc analysis of data.

Which tool is better – Pig or Hive?

There is no straight forward answer. Both tools are equally important and have strong user base and communities. Both tools can be highly configurable and allow easy integration with custom Java code.

Pig Latin is an easy to learn instructional language and don’t think of it as a new programming language as it is easy to follow and learn.

Hive has a shorter learning curve because anyone who is familiar with SQL will feel right at home with the tool. Hive does allow developers to see the data in row columnar fashion which is a great plus.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Pig vs. Hive

Datanode Block Scanner

Hadoop Archives (HAR)

Datanode Block Scanner

Hadoop Archives (HAR)

Pig vs. Hive

Why do we have two tools performing somewhat similar operation in Hadoop ecosystem?

Does Pig and Hive co-exist in Hadoop production environments?

Which tool is better – Pig or Hive?

Big Data In Real World

Related posts

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

Hadoop In Real World is changing to Big Data In Real World