Spark not using partition information from Hive partitioned external table

Question

I have a complex/nested Hive-External table which is created on top of HDFS (Files are in avro format). When I run the hive query it shows all records and partitions.

However when I use the same table in Spark:

val df =  spark
.read
.format("avro")
.load("avro_files")
.option("avroSchema", Schema.toString)

It does not show the partition column.

But, when I use spark.sql("select * from hive_External_Table"), it is fine and I can see it in the created dataframe but the problem is that I cannot manually pass the provided schema.

Please note, when I looked at the data, the partition column is not part of the underlying saved data, but I can see it when I query the table through Hive.I also can see the partition column when I try to load the avro files using pyspark:

df = ( sqlContext.read.format("com.databricks.spark.avro").option("avroSchema", pegIndivSchema).load('avro_files'))

So I was wondering what it is like that?

kirtan_shah kirtan_shah · Accepted Answer · 2020-01-21T09:02:44

Please check the columns present in the Schema.toString value which you have used in the option schema part. It would be having the partition column missing. Also try using the same schema you have used in the pyspark code.

option("avroSchema", pegIndivSchema)

Spark not using partition information from Hive partitioned external table

1 Answers