Can't query Spark DF from Hive after `saveAsTable` - Spark SQL specific format, which is NOT compatible with Hive

Question

I am trying to save a dataframe as an external table which will be queried both with spark and possibly with hive, but somehow, I cannot query or see any data with hive. It works on in spark.

Here is how to reproduce the problem:

scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]

scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]                                      
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Here is how I query in hive:

Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col  |
+------------------+
+------------------+
No rows selected (0.213 seconds)

But it works fine in spark:

scala> spark.sql("SELECT * FROM hive_random").show
+---+------+                                                                    
| id|random|
+---+------+
|  3|  13.0|
| 15|  13.0|
...
|  8|  46.0|
|  9|  65.0|
+---+------+

Spark can query it because it just uses hive metastore, but Hive uses MapReduce , Tez or whatever under the hood. You can try to set Hive engine to Spark and test. — gorros
I would suggest creating a table using DDL and running . df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random") omitting save as table — VenkateswaraCh

a9207 a9207 · Accepted Answer · 2019-08-02T08:10:29

There is warning after your saveAsTable call. That's where the hint lies -

'Persisting bucketed data source table default.hive_random into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.'

The reason being 'saveAsTable' creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable.

Can't query Spark DF from Hive after `saveAsTable` - Spark SQL specific format, which is NOT compatible with Hive

3 Answers