1
votes

I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table.

It would be great if the result would also include the datatype of the partitioned columns.

Any suggestions will be helpful.

3

3 Answers

4
votes

It can be done using desc as shown below:

df=spark.sql("""desc test_dev_db.partition_date_table""")
>>> df.show(truncate=False)
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|emp_id                 |int      |null   |
|emp_name               |string   |null   |
|emp_salary             |int      |null   |
|emp_date               |date     |null   |
|year                   |string   |null   |
|month                  |string   |null   |
|day                    |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|year                   |string   |null   |
|month                  |string   |null   |
|day                    |string   |null   |
+-----------------------+---------+-------+

Since this table was partitioned, So here you can see the partition column information along with their datatypes.

It seems your are interested in just partition column name and their respective data types. Hence I am creating a list of tuples.

partition_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()

>>> print partition_list
[(u'emp_id', u'int'), (u'emp_name', u'string'), (u'emp_salary', u'int'), (u'emp_date', u'date'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string'), (u'# Partition Information', u''), (u'# col_name', u'data_type'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string')]

partition_details = [partition_list[index+1:] for index,item in enumerate(partition_list) if item[0]=='# col_name']

>>> print partition_details
[[(u'year', u'string'), (u'month', u'string'), (u'day', u'string')]]

It will return empty list in case table is not partitioned. Hope this helps.

0
votes

Another simple method through pyspark script .

from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql.functions import col, concat, lit

descschema = StructType([ StructField("col_name", StringType())
                       ,StructField("data_type", StringType())
                       ,StructField("comment", StringType())])                  
df = spark.sql(f"describe formatted serve.cust_transactions" )
df2=df.where((f.col("col_name")== 'Part 0') | (f.col("col_name")== 'Part 2') | (f.col("col_name")== 'Name')).select(f.col('data_type'))
df3 =df2.toPandas().transpose()
display(df3)

Result would be :

enter image description here

0
votes

The following snippet

  1. Gets the columns for the given table
  2. Filters out partition columns
  3. Extracts (name, datatype) tuples from the partition columns
# s: pyspark.sql.session.SparkSession
# table: str

# 1.  Get table columns for given table
columns = s.catalog.listColumns(table)

# 2. Filter out partition columns
partition_columns = list(filter(lambda c: c.isPartition , columns))

# 3. Now you can extract the name and dataType (among other attributes)
[ (c.name, c.dataType) for c in partition_columns ]