spark dataset overwrite particular partition not working in spark 2.4

Question

In my job final step is to store the executed data in Hive table with partition on "date" column. Sometime, due to job fail, I need to re-run job for particular partition alone. As observed, when I use below code, spark overrides all the partitions when using overwrite mode.

ds.write.partitionBy("date").mode("overwrite").saveAsTable("test.someTable")

After going through multiple blogs and stackoverflow, I followed below steps to overwrite particular partitions only.

Step 1: Enbable dynamic partition for overwrite mode
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")

Step 2: write dataframe to hive table using saveToTable

Seq(("Company1", "A"), 
("Company2","B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable(targetTable)

spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)

Seq(("CompanyA3", "A"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.insertInto(targetTable)

spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)

Still it overwrite all the partitions.

As per this blog, https://www.waitingforcode.com/apache-spark-sql/apache-spark-sql-hive-insertinto-command/read, "insertinto" should overwrite only particular partitions

If I create table first and then use "insertinto" method, it working fine

Set required configuration, Step 1: Create table

Step 2: Add data using insertinto method

Step 3: Overwrite paritition

I wanted to know, what is difference between creating hive table via SaveToTable and creating table manually? Why it is not working in first scenario? Could any one help me in this?

Worked some time ago, will need to investigate, will get back. — thebluephantom
Interesting in that some time ago my deleted answer worked. Not sure what to make of that. — thebluephantom
If you don't mind, could you please share again deleted answer? — user4164776
I got it to work now. I am looking after dinner what the difference is with your code. Things have changed. Are you runnin gin Notebook or spark shell or compiled program??? — thebluephantom

thebluephantom thebluephantom · Accepted Answer · 2020-03-18T18:22:01

Try with lowercase w!

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

not

spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")

It fooled me. You have 2 variations in use in your scripting if you look.

My original answer deprecated it appears.

spark dataset overwrite particular partition not working in spark 2.4

1 Answers