My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem.
My problem, however, is that I don't understand the problem they are both solving in the first place!
Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data:
I'm confused as to which system Hive/Pig are querying! Are they querying the database? Are they querying the raw input data stored in the DataNodes on HDFS? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs?
What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself?