Hive Query Control Flow?

Question

What is the control flow of a hive query ?

Let's Say, I would like to join Emp_Table with Dept_Table,

How does the flow goes ?

From which table in the meta store, it fetches all relevant informations ?

Such as, 1) Where is the file that corresponds to Emp_Table ? (HDFS Location) 2) What are the names of fields of table Emp_Table ? 3) What is the delimiter in the file that contains the data of Emp_Table ? 4) How about the data is bucketed or partitioned, in that case from where (Meta Store Table Name) and how (Query) that gives the HDFS Folder locations ?

Tariq Tariq · Accepted Answer · 2013-06-13T19:49:20

The flow goes like this :

Step 1 : A Hive client triggers a query(CLI or some external client using JDBC, ODBC or Thrift or webUI).

Step 2 : Compiler receives the query and connects to the metastore.

Step 3: Start of the compilation phase.

Parser

Converts the query into parse tree representation. ANTLR is used to generate the abstract syntax tree(AST).

Semantic analyzer

The compiler builds a logical plan based on the information provided by the metastore on the input and output tables. The compiler also checks type compatibilities and notifies about compile-time semantic errors at this stage.

QBT creation

In this step transformation of AST into an intermediate representation takes place, called as query block(QB) tree.

Logical plan generator

At this step compiler writes the logical plan from the semantic analyzer into a logical tree of operations.

Optimization

This is the heaviest part of compilation phase as the entire series of DAG optimizations take place in this phase. It involves following tasks :

Logical optimization

Column pruning

Predicate pushdown

Partition pruning

Join optimization

Grouping(and regrouping)

Repartitioning

Conversion of logical plan into physical plan by physical plan generator

Creation of final DAG workflow of MapReduce by physical plan generator

Step 4: Execution engine gets the compiler outputs to execute them on the Hadoop platform. It involves following tasks :

A MapReduce task first serializes its part of the plan into a plan.xml file.

plan.xml file is then added to the job cache for the task and the instances of ExecMapper and ExecReducer are spawned using Hadoop.

Each of these classes deserializes the plan.xml file and executes the relevant part of the task.

The final results are stored in a temporary location and at the completion of the entire query the results are moved to the table if it was inserts or partitions. Otherwise returned to the calling program at a temporary location.

Note : All the tasks are executed in the order of their dependencies. Each is only executed if all of its prerequisites have been executed.

And to know about the metastore tables and their fields you can have a look at the MR diagram for metastore :

enter image description here

HTH

Hive Query Control Flow?

2 Answers