Pyspark: Replacing old data with new ones on hive partition table using dataframe

Question

Could you guide me to replace the old data with new data on specific hive partition using pyspark(dataframe)?

I have a hive partitioned table, partition by county.

Every month I get records for some counties. I wanted to replace the old data with the new ones on that partition.

I have developed the script with pyspark and have loaded the new data for a particular partition in a dataframe. Now, I would like to replace the old data with new data for that partition alone.(Since the base table is big)

I tried dropping that partition and appending the new dataframe. But this seems not working(it has both old + new data). "Overwrite" drops the table.

Is there any way in spark to replace the old data with new data(new_dataframe) for a particular partition alone?

Please find the below code that I have tried with:

sqlContext.sql('alter table db.tbl drop partition(partition=' + part1 + ')')
    new_dataframe.write.mode("append/overwrite").format("parquet").partitionBy("partition").saveAsTable('db.tbl')

Could you please guide/suggest some approach in spark?

I have googled and saw comments like deleting the partition file in HDFS. I do not see this as a good approach. The other one was to run a hive query - insert overwrite on that partition. This is on hive side.

Is there any similiar functionality in spark?

Thank you.

Prasanna Saraswathi Krishnan Prasanna Saraswathi Krishnan · Accepted Answer · 2018-03-01T22:35:28

The below works well for me (CDH Spark 2.1.0/Python 2.7).

>>> df.write.saveAsTable('<table_name>',mode='overwrite',partitionBy=('county'))

the default format of files getting written is parquet for me. So, I omitted the format.

Pyspark: Replacing old data with new ones on hive partition table using dataframe

1 Answers