Pyspark: selecting data in remote Hive Server

Question

Trying to read and write data stored in remote Hive Server from Pyspark. I follow this example:

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'hdfs://quickstart.cloudera:8020/user/hive/warehouse'

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

Example shows how to create a new table in warehouse:

# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

# Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show()

Yet, I need to access existing tabe iris that was created in mytest.db, so table location is

table_path = warehouse_location + '/mytest.db/iris`

How to select from existing table?

Update

I have metastore url:

http://test.mysite.net:8888/metastore/table/mytest/iris

and table location url:

hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris

when using hdfs://quickstart.cloudera:8020/user/hive/warehouse as a warehouse location in the code above and trying:

spark.sql("use mytest")

I get exception:

    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Database 'mytest' not found;"

What would be correct url to select from iris?

MaFF MaFF · Accepted Answer · 2017-09-04T13:19:58

You can call the table directly using

spark.sql("SELECT * FROM mytest.iris")

Or specify the database you want to work with with

spark.sql("use mytest")
spark.sql("SELECT * FROM iris)

Pyspark: selecting data in remote Hive Server

1 Answers