13
votes

Kind of edge case, when saving parquet table in Spark SQL with partition,

#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
    DataTypes.createStructField("time", DataTypes.StringType, true),
    DataTypes.createStructField("accountId", DataTypes.StringType, true),
    ...

DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);

df.coalesce(1)
    .write()
    .mode(SaveMode.Append)
    .format("parquet")
    .partitionBy("year")
    .saveAsTable("tblclick8partitioned");

Spark warns:

Persisting partitioned data source relation into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive

In Hive:

hive> describe tblclick8partitioned;
OK
col                     array<string>           from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)

Obviously the schema is not correct - however if I use saveAsTable in Spark SQL without partition the table can be queried without problem.

Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?

2
data is stored into hdfs and metadata is stored into hive metastoredunlu_98k
Did you try to "register as temp table" then run SQL commands "CREATE TABLE" then "INSERT <with dynamic partitioning syntax>"?Samson Scharfrichter
Thank you Samson , not yet but isn`t saveAsTable doing what it suppose to ?dunlu_98k
Maybe it depends on which version of Spark you are using, e.g. "its' not a bug, it's a feature" vs. "will be implemented someday"Samson Scharfrichter

2 Answers

11
votes

That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:

hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")

Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
  .write
  .partitionBy("year")
  .mode(SaveMode.Append)
  .saveAsTable("tmp.partitiontest1")
1
votes

A solution is to create the table with Hive and then save the data with ...partitionBy("year").insertInto("default.mytable").

In my experience, creating the table in Hive and then using ...partitionBy("year").saveAsTable("default.mytable") did not work. This is with Spark 1.6.2.