Write PySpark dataframe into Partitioned Hive table

Question

I am learning Spark. I have a dataframe ts of below structure.

ts.show()
+--------------------+--------------------+
|    UTC|                 PST|
+--------------------+--------------------+
|2020-11-04 02:24:...|2020-11-03 18:24:...|
+--------------------+--------------------+

I need to insert ts into Partitioned table in Hive with below structure,

spark.sql(""" create table db.ts_part
(
UTC timestamp,
PST timestamp
)
PARTITIONED BY(  bkup_dt DATE )
STORED AS ORC""")

How do i dynamically pass system run date in the insert statement so that it gets partitioned on bkup_dt in table based on date.

I tried something like this code. But it didn't work

ts.write.partitionBy(current_date()).insertInto("db.ts_part",overwrite=False)

How should I do it? Can someone please help!

what did not work ? what is the unexpected behavior or the error code ? — Steven

notNull notNull · Accepted Answer · 2020-11-27T15:14:10

Try by creating new column with current_date() and then write as partitioned by hive table.

Example:

df.\
withColumn("bkup_dt",current_date()).\
write.\
partitionBy("bkup_dt").\
insertInto("db.ts_part",overwrite=False)

UPDATE:

try by creating temp view then run insert statement.

df.createOrReplaceTempView("tmp")

sql("insert into table <table_name> partition (bkup_dt) select *,current_date bkup_dt from tmp")

Write PySpark dataframe into Partitioned Hive table

1 Answers