Previously I could work entirely within the spark.sql api to interact with both hive tables and spark data frames. I could query views registered with spark or the hive tables with the same api.
I'd like to confirm, that is no longer possible with hadoop 3.1 and pyspark 2.3.2? To do any operation on a hive table one must use the 'HiveWarehouseSession' api and not the spark.sql api. Is there any way to continue using the spark.sql api and interact with hive or will I have to refactor all my code?
hive = HiveWarehouseSession.session(spark).build()
hive.execute("arbitrary example query here")
spark.sql("arbitrary example query here")
It's confusing because the spark documentation says
Connect to any data source the same way
and specifically gives Hive as an example, but then the Hortonworks hadoop 3 documentation says
As a Spark developer, you execute queries to Hive using the JDBC-style HiveWarehouseSession API
These two statements are in direct contradiction.
The Hadoop documentation continues "You can use the Hive Warehouse Connector (HWC) API to access any type of table in the Hive catalog from Spark. When you use SparkSQL, standard Spark APIs access tables in the Spark catalog."
At least as of present, Spark.sql spark is no longer universal correct? and I can no longer seamlessly interact with hive tables using the same api?