1
votes

Previously I could work entirely within the spark.sql api to interact with both hive tables and spark data frames. I could query views registered with spark or the hive tables with the same api.

I'd like to confirm, that is no longer possible with hadoop 3.1 and pyspark 2.3.2? To do any operation on a hive table one must use the 'HiveWarehouseSession' api and not the spark.sql api. Is there any way to continue using the spark.sql api and interact with hive or will I have to refactor all my code?

hive = HiveWarehouseSession.session(spark).build()
hive.execute("arbitrary example query here")
spark.sql("arbitrary example query here")

It's confusing because the spark documentation says

Connect to any data source the same way

and specifically gives Hive as an example, but then the Hortonworks hadoop 3 documentation says

As a Spark developer, you execute queries to Hive using the JDBC-style HiveWarehouseSession API

These two statements are in direct contradiction.

The Hadoop documentation continues "You can use the Hive Warehouse Connector (HWC) API to access any type of table in the Hive catalog from Spark. When you use SparkSQL, standard Spark APIs access tables in the Spark catalog."

At least as of present, Spark.sql spark is no longer universal correct? and I can no longer seamlessly interact with hive tables using the same api?

1

1 Answers

0
votes

Yep, correct. I'm using Spark 2.3.2 but I can no longer access to hive tables using Spark SQL default API. From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, they are mutually exclusive. As you mentioned you have to use HiveWarehouseSession from pyspark-llap library.