Filtering Hive partition table in pyspark

Question

I have a hive table which is partitioned on many countries. I want to load specific partition data to my dataframe, as shown below:

df=spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table").where('country="NCL"' && 'county="RUS"')

It's giving me an error, though I was able to load for single partition.

below is my directory structure in hdfs

/apps/hive/warehouse/emp.db/partition_load_table/country=NCL

df=spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table").where('country="NCL"')

Tim Tim · Accepted Answer · 2018-10-09T14:51:31

Not sure why you don't just query the hive table directly using HQLContext:

spark.sql("select * from partition_load_table where country in ('NCL', 'RUS')")

If for some reason that is not available you can union the underlying hive partitions. First read them in as separate dataframes and union. Something like:

rus = spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table/country=rus") ncl = spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table/country=ncl") df = rus.union(ncl)

Filtering Hive partition table in pyspark

1 Answers