2
votes

My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem.

My problem, however, is that I don't understand the problem they are both solving in the first place!

Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data:

enter image description here

I'm confused as to which system Hive/Pig are querying! Are they querying the database? Are they querying the raw input data stored in the DataNodes on HDFS? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs?

What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself?

4

4 Answers

4
votes

Apache Pig and Apache Hive load data from the HDFS unless you run it locally, in which case it will load it locally. How does it get the data from a DB? It does not. You need other framework to export the data in your traditional DB into your HDFS, such as Sqoop.

Once you have the data in your HDFS, you can start working with Pig and Hive. They never query a DB. In Apache Pig, for example, you could load your data using a Pig loader:

A = LOAD 'path/in/your/HDFS' USING PigStorage('\t');

As for Hive, you need to create a table and then load the data into the table:

LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;

Again, the data must be in the HDFS.

As to how it works, it depends. Traditionally it has always worked with a MapReduce execution engine. Both Hive and Pig parse the statements you write in PigLatin or HiveQL and translate it into an execution plan consisting of a certain number of MapReduce jobs, depending on the plan. However, now it can also translate it into Tez, a new execution engine which perhaps is too new to work correctly.

Why the need of Pig or Hive? Well, you really don't need these frameworks. Everything they can do, you can do it as well writing your own MapReduce or Tez jobs. However, writing for instance a JOIN operation in MapReduce might take hundreds or thousands of lines of code (really), while it is only one single line of code in Pig or Hive.

2
votes

I dont think you can query any data with Hive/Pig without actually adding to them. So first you need to add data. And this data can be coming from any place and you just give the path for the data to be picked or add directly to them. Once data is in place, the query fetches the data only from those tables.

Underneath, they use map reduce as a tool to do the process. If you just have on the go data lying somewhere and need analysis, you can directly go to map redue and define your own logic. Hive is mostly at the SQL front. So you get querying features similar to SQL, and at the backend, map reduce does the job. Hope this info helps

0
votes

Im not agree with that Pig and Hive solve the same problem, Hive is for querying data stored on hdfs as external or internal tables, Pig is for managing data flow stored on hdfs in a Directed Acyclic Graph, this is their main goals and we dont care about other uses, here i want to make difference between :

  • Querying data (the main purpose of Hive) which is getting answers to some questions about your data, for example : How many distinct user visiting my website per mounth in this year.
  • Managing a data flow (the main purpose of Pig) is to make your data go from initial state to have at the end a different state through transformations, for example : Data in location A filtered by critiria c joined with data in location B stored in location C.
0
votes

Smeeb, Pig,Hive does same thing , I mean processing data which comes in files or what ever format. here if you want to process data present in RDMS, first get that data to HDFS with help of Sqoop(SQL+HADOOP).

Hive used HQL like SQL to process, Pig uses kind flow with help of piglatin. Hive stores all input data in tables format so, first thing before load data to Hive create a hive table, that structure (metadata) will be stored in any RDMS(Mysql). Then load with LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;