I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)
I use single configuration file conf/hive-site.xml
for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL
property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true
. I also set spark.sql.warehouse.dir
property to absolute path pointing to spark-warehouse
directory. I run Thrift server with ./start-thriftserver.sh
and I can observe that embedded Derby database is being created with metastore_db
directory. I can connect with beeline
, create a table and see spark-warehouse
directory created with subdirectory for table. So at this stage it's fine.
I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive
, and try to access the Hive table with:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')
I got errors like:
ERROR XJ040: Failed to start database '/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader sun.misc.Launcher$AppClassLoader@1b4fb997
ERROR XSDB6: Another instance of Derby may have already booted the database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?
Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.