Could you guide me to replace the old data with new data on specific hive partition using pyspark(dataframe)?
I have a hive partitioned table, partition by county.
Every month I get records for some counties. I wanted to replace the old data with the new ones on that partition.
I have developed the script with pyspark and have loaded the new data for a particular partition in a dataframe. Now, I would like to replace the old data with new data for that partition alone.(Since the base table is big)
I tried dropping that partition and appending the new dataframe. But this seems not working(it has both old + new data). "Overwrite"
drops the table.
Is there any way in spark to replace the old data with new data(new_dataframe) for a particular partition alone?
Please find the below code that I have tried with:
sqlContext.sql('alter table db.tbl drop partition(partition=' + part1 + ')')
new_dataframe.write.mode("append/overwrite").format("parquet").partitionBy("partition").saveAsTable('db.tbl')
Could you please guide/suggest some approach in spark?
I have googled and saw comments like deleting the partition file in HDFS. I do not see this as a good approach. The other one was to run a hive query - insert overwrite on that partition. This is on hive side.
Is there any similiar functionality in spark?
Thank you.