Configuring Spark and a standalone Hive MetaStore to persist DafaFrames to s3

Question

I'm trying to persist a DataFrame in SparkSQL with a HiveContext and I'm seeing the following errors when I submit my job to a standalone local spark server:

   15/11/18 15:49:52 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
    org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    ... 16 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
    ... 22 more
Caused by: java.lang.NullPointerException
    at org.apache.thrift.transport.TSocket.open(TSocket.java:170)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
    ... 27 more

I'm running Spark built with -Phive -Phive-thriftserver on hadoop 2.4.0. I have a standalone Hive meta store running in EC2 that I can connect to locally from it's host. It is backed by Postgres and is setup fine AFAIK. I copied the hive-site.xml to my local Spark install's conf directory.

This is my spark-submit:

./bin/spark-submit --class etl.MainExample --master spark://localhost:7077 --driver-class-path libs/postgresql-9.4-1203.jdbc41.jar   sparkETL/target/spark.etl-1.0-SNAPSHOT-jar-with-dependencies.jar

My Scala code basically does this:

val schemaDef2 = "some fields......"
val dataSchema = StructType(schemaDef2.split(",").map(fieldName => StructField(fieldName, StringType,false)))

val hc = new HiveContext(sc)

//Results is an RDD
val newDF = hc.createDataFrame(results, dataSchema)

newDF.repartition(1).write.format("parquet").mode(SaveMode.Overwrite).saveAsTable("MyTable")

I'm able to save this to a local parquet file as well as a text/csv file, but I want it registered with hive metastore. Eventually stored in s3.

Am I missing passing a jar or something to spark-submit? I'm completely stuck at this point.

Joe Widen Joe Widen · Accepted Answer · 2015-11-18T23:59:34

Save as table isn't actually creating a hive table. Its a spark sql operation, not using the hive sql.

This should set you up correctly:

newDF.registerTempTable("newDF")
hc.sql('create table newDFHive as select * from newDF')

Configuring Spark and a standalone Hive MetaStore to persist DafaFrames to s3

1 Answers