1
votes

In my job final step is to store the executed data in Hive table with partition on "date" column. Sometime, due to job fail, I need to re-run job for particular partition alone. As observed, when I use below code, spark overrides all the partitions when using overwrite mode.

ds.write.partitionBy("date").mode("overwrite").saveAsTable("test.someTable")

After going through multiple blogs and stackoverflow, I followed below steps to overwrite particular partitions only.

Step 1: Enbable dynamic partition for overwrite mode
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")

Step 2: write dataframe to hive table using saveToTable

Seq(("Company1", "A"), 
("Company2","B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable(targetTable)

spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)

Seq(("CompanyA3", "A"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.insertInto(targetTable)

spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)

Still it overwrite all the partitions.enter image description here enter image description here

As per this blog, https://www.waitingforcode.com/apache-spark-sql/apache-spark-sql-hive-insertinto-command/read, "insertinto" should overwrite only particular partitions

If I create table first and then use "insertinto" method, it working fine

Set required configuration, enter image description here Step 1: Create table

enter image description here

Step 2: Add data using insertinto method enter image description here

Step 3: Overwrite paritition enter image description here

I wanted to know, what is difference between creating hive table via SaveToTable and creating table manually? Why it is not working in first scenario? Could any one help me in this?

1
Worked some time ago, will need to investigate, will get back. - thebluephantom
Interesting in that some time ago my deleted answer worked. Not sure what to make of that. - thebluephantom
If you don't mind, could you please share again deleted answer? - user4164776
I got it to work now. I am looking after dinner what the difference is with your code. Things have changed. Are you runnin gin Notebook or spark shell or compiled program??? - thebluephantom
Got it, just a typo that fooled us. - thebluephantom

1 Answers

0
votes

Try with lowercase w!

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

not

spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")

It fooled me. You have 2 variations in use in your scripting if you look.

My original answer deprecated it appears.