Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

Question

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.

I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.

I create a session in my Spark program as follows:

SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()

Suppose I have the following HiveQL query:

spark.sql("SELECT someColumn FROM someTable")

I would like to know whether:

under the hood this query is translated into Hive MapReduce primitives, or
the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.

I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.

Raphael Roth Raphael Roth · Accepted Answer · 2018-09-04T18:49:57

Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport(), then spark.sql.catalogImplementation is set to hive, otherwise to in-memory. So if you enable hive support, spark.catalog.listTables().show() will show you all tables from the hive metastore.

But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.

*there are actually some functions like percentile und percentile_approx which are native hive UDAF.

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

4 Answers