I'm trying to persist a DataFrame in SparkSQL with a HiveContext and I'm seeing the following errors when I submit my job to a standalone local spark server:
15/11/18 15:49:52 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
... 16 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
... 22 more
Caused by: java.lang.NullPointerException
at org.apache.thrift.transport.TSocket.open(TSocket.java:170)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
... 27 more
I'm running Spark built with -Phive -Phive-thriftserver on hadoop 2.4.0. I have a standalone Hive meta store running in EC2 that I can connect to locally from it's host. It is backed by Postgres and is setup fine AFAIK. I copied the hive-site.xml to my local Spark install's conf directory.
This is my spark-submit:
./bin/spark-submit --class etl.MainExample --master spark://localhost:7077 --driver-class-path libs/postgresql-9.4-1203.jdbc41.jar sparkETL/target/spark.etl-1.0-SNAPSHOT-jar-with-dependencies.jar
My Scala code basically does this:
val schemaDef2 = "some fields......"
val dataSchema = StructType(schemaDef2.split(",").map(fieldName => StructField(fieldName, StringType,false)))
val hc = new HiveContext(sc)
//Results is an RDD
val newDF = hc.createDataFrame(results, dataSchema)
newDF.repartition(1).write.format("parquet").mode(SaveMode.Overwrite).saveAsTable("MyTable")
I'm able to save this to a local parquet file as well as a text/csv file, but I want it registered with hive metastore. Eventually stored in s3.
Am I missing passing a jar or something to spark-submit? I'm completely stuck at this point.