Pyspark: remote Hive warehouse location

Question

I need to read / write tables stored in remote Hive Server from Pyspark. All I know about this remote Hive is that it runs under Docker. From Hadoop Hue I have found two urls for an iris table that I try to select some data from:

I have a table metastore url:

http://xxx.yyy.net:8888/metastore/table/mytest/iris

and table location url:

hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris

I have no idea why last url contains quickstart.cloudera:8020. Maybe this is because Hive runs under Docker?

Discussing access to Hive tables Pyspark tutorial writes:

https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.

In my case hive-site.xml that I managed to get does not have neither hive.metastore.warehouse.dir nor spark.sql.warehouse.dir property.

Spark tutorial suggests to use the following code to access remote Hive tables:

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

   // warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

And in my case, after running similar to the above code, but with correct value for warehouseLocation, I think I can then do:

spark.sql("use mytest")
spark.sql("SELECT * FROM iris").show()

So where can I find remote Hive warehouse location? How to make Pyspark to work with remote Hive tables?

Update

hive-site.xml has the following properties:

...
...
...
 <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
...
...
...
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://127.0.0.1:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>

So it looks like 127.0.0.1 is Docker localhost that runs Clouder docker app. Does not help to get to Hive warehouse at all.

How to access Hive warehouse when Cloudera Hive runs as a Docker app.?

MaFF MaFF · Accepted Answer · 2017-09-05T12:40:13

Here https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html at "Remote Mode" you'll find that you the Hive metastore runs its own JVM process, other process such as HiveServer2, HCatalog, Cloudera Impala communicate with it through the Thrift API using property hive.metastore.uri in the hive-site.xml:

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://xxx.yyy.net:8888</value>
</property>

(Not sure about the way you have to specify the address)

And maybe this property too:

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://xxx.yyy.net/hive</value>
</property>

Pyspark: remote Hive warehouse location

1 Answers