Apache Spark: how can I understand and control if my query is executed on Hive engine or on Spark engine?

Question

I am running local instance of spark 2.4.0

I want to execute an SQL query vs Hive

Before, with Spark 1.x.x., I was using HiveContext for this:

import org.apache.spark.sql.hive.HiveContext
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hivequery = hc.sql(“show databases”)

But now I see that HiveContext is deprecated: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/hive/HiveContext.html. Inside HiveContext.sql() code I see that it is now simply a wrapper over SparkSession.sql(). The recomendation is to use enableHiveSupport in SparkSession builder, but as this question clarifies this is only about metastore and list of tables, this is not changing execution engine.

So the questions are:

how can I understand if my query is running on Hive engine or on Spark engine?
how can I control this?

Nonontb Nonontb · Accepted Answer · 2021-03-18T08:47:56

From my understanding there is no Hive Engine to run your query. You submit a query to Hive and Hive would execute it on an engine :

Spark
Tez(based on MapReduce)
MapReduce (commnly Hadoop)

If you use Spark, your query will be executed by Spark using SparkSQL (starting with Spark v1.5.x, if I recall correctly)

How is configured the Hive Engine depends on configuration and I remember seeing Hive on Spark configuration on Cloudera distribution. So Hive would use Spark to execute the job matching you query (instead of MapReduce or Tez) but Hive would parse, analyze it.

Using local Spark instance, you will only use Spark engine (SparkSQL / Catalyst), but you can use it with Hive Support. It means, you would be able to read an existing Hive metastore and interact with it.

It requires a Spark installation with Hive support : Hive dependencies and hive-site.xml in your classpath

Apache Spark: how can I understand and control if my query is executed on Hive engine or on Spark engine?

1 Answers