2
votes

Could you guide me to replace the old data with new data on specific hive partition using pyspark(dataframe)?

I have a hive partitioned table, partition by county.

Every month I get records for some counties. I wanted to replace the old data with the new ones on that partition.

I have developed the script with pyspark and have loaded the new data for a particular partition in a dataframe. Now, I would like to replace the old data with new data for that partition alone.(Since the base table is big)

I tried dropping that partition and appending the new dataframe. But this seems not working(it has both old + new data). "Overwrite" drops the table.

Is there any way in spark to replace the old data with new data(new_dataframe) for a particular partition alone?

Please find the below code that I have tried with:

sqlContext.sql('alter table db.tbl drop partition(partition=' + part1 + ')')
    new_dataframe.write.mode("append/overwrite").format("parquet").partitionBy("partition").saveAsTable('db.tbl')

Could you please guide/suggest some approach in spark?

I have googled and saw comments like deleting the partition file in HDFS. I do not see this as a good approach. The other one was to run a hive query - insert overwrite on that partition. This is on hive side.

Is there any similiar functionality in spark?

Thank you.

1
Facing same issue, any sugestions?Maviles

1 Answers

0
votes

The below works well for me (CDH Spark 2.1.0/Python 2.7).

>>> df.write.saveAsTable('<table_name>',mode='overwrite',partitionBy=('county'))

the default format of files getting written is parquet for me. So, I omitted the format.