Accessing Hive Tables with Spark SQL

Question

I've setup an AWS EMR cluster that includes spark 2.3.2, hive 2.3.3, and hbase 1.4.7. How can I configure spark to access hive tables?

I've taken the following steps, but the result is the error message:

java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning when creating Hive client using classpath:

Please make sure that jars for your version of hive and hadoop are included in the paths passed to spark.sql.hive.metastore.jars

Steps:

cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf
In /usr/lib/spark/conf/spark-defaults.conf added:

spark.sql.hive.metastore.jars /usr/lib/hadoop/lib/*:/usr/lib/hive/lib/*
In zeppelin I create a spark session:

val spark = SparkSession.builder.appName("clue").enableHiveSupport().getOrCreate() import spark.implicits._

aelbuni aelbuni · Accepted Answer · 2019-05-03T05:02:16

Step (1, & 2) you mentioned are partially fine, except for a little tweak that might help you.

Since you are using hive-2.x, configure spark.sql.hive.metastore.jars and set it to maven instead and spark.sql.hive.metastore.version to match the version of your metastore 2.3.3. It should be sufficient to just use 2.3 as version, see why in the Apache Spark Code

Here is a sample of my working configuration that I set in spark-default.conf:

spark.sql.broadcastTimeout  600 # An arbitrary number that you can change
spark.sql.catalogImplementation hive
spark.sql.hive.metastore.jars   maven
spark.sql.hive.metastore.version    2.3  # No need for minor version
spark.sql.hive.thriftServer.singleSession   true
spark.sql.warehouse.dir {hdfs | s3 | etc}
hive.metastore.uris thrift://hive-host:9083

With the previous setup, I have been able to execute queries against my datawarehouse in Zeppelin as follow:

val rows = spark.sql("YOUR QUERY").show

More details for connecting to an external hive metastore can be found here (Databricks)

Accessing Hive Tables with Spark SQL

1 Answers