1
votes

I am new to Hadoop/Spark/Hive!

I have created a single node linux (Ubuntu 18.04.1 LTS) VM running locally with the following; Hadoop 3.1.0; Spark: Spark 2.3.1, Hive: Hive-3.0.0

My Hive is using the standard Derby DB and I can access hive through the terminal and create databases, tables and then query these tables fine. My metastore_db is located at ~/hivemetastore/metastore_db

I also have also created the following:

hadoop fs -mkdir -p /user/hive/warehouse

hadoop fs -mkdir -p /tmp/hive

(Note- I do not have any hive-site.xml files under $HIVE_HOME/conf or $SPARK_HOME/conf)

However when I try to read a hive table from pyspark (via terminal), I get an error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 710, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 69, in decoraise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

The code I am using to access hive from pyspark is:

from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sql('show databases').show()
1

1 Answers

0
votes

Did you start the metastore?

Type:

hive --service metastore

Remove the lock using rm metastore_db/*.lck or restart the system (or the PySpark shell).