Hadoop's Hive/Pig, HDFS and MapReduce relationship

Question

My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem.

My problem, however, is that I don't understand the problem they are both solving in the first place!

Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data:

enter image description here

I'm confused as to which system Hive/Pig are querying! Are they querying the database? Are they querying the raw input data stored in the DataNodes on HDFS? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs?

What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself?

Balduz Balduz · Accepted Answer · 2015-06-25T16:35:45

Apache Pig and Apache Hive load data from the HDFS unless you run it locally, in which case it will load it locally. How does it get the data from a DB? It does not. You need other framework to export the data in your traditional DB into your HDFS, such as Sqoop.

Once you have the data in your HDFS, you can start working with Pig and Hive. They never query a DB. In Apache Pig, for example, you could load your data using a Pig loader:

A = LOAD 'path/in/your/HDFS' USING PigStorage('\t');

As for Hive, you need to create a table and then load the data into the table:

LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;

Again, the data must be in the HDFS.

As to how it works, it depends. Traditionally it has always worked with a MapReduce execution engine. Both Hive and Pig parse the statements you write in PigLatin or HiveQL and translate it into an execution plan consisting of a certain number of MapReduce jobs, depending on the plan. However, now it can also translate it into Tez, a new execution engine which perhaps is too new to work correctly.

Why the need of Pig or Hive? Well, you really don't need these frameworks. Everything they can do, you can do it as well writing your own MapReduce or Tez jobs. However, writing for instance a JOIN operation in MapReduce might take hundreds or thousands of lines of code (really), while it is only one single line of code in Pig or Hive.

Hadoop's Hive/Pig, HDFS and MapReduce relationship

4 Answers