0
votes

I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)

I use single configuration file conf/hive-site.xml for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true. I also set spark.sql.warehouse.dir property to absolute path pointing to spark-warehouse directory. I run Thrift server with ./start-thriftserver.sh and I can observe that embedded Derby database is being created with metastore_db directory. I can connect with beeline, create a table and see spark-warehouse directory created with subdirectory for table. So at this stage it's fine.

I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive, and try to access the Hive table with:

from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')

I got errors like:

ERROR XJ040: Failed to start database '/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader sun.misc.Launcher$AppClassLoader@1b4fb997

ERROR XSDB6: Another instance of Derby may have already booted the database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db

pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;

Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?

Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.

1

1 Answers

0
votes

Embedded Derby Metastore Database is fine to be used in local, but for production environment, it is recommended to use any other Metastore database. Yes, you can definitely use MYSQL as metastore. For this, you have to make an entry in hive-site.xml.

You can follow the configuration guide at Use MySQL for the Hive Metastore for the exact details.