How to insert Spark DataFrame to Hive Internal table without duplicating?

Question

So command to append spark dataframe directly to hive table is,

df.write().mode("append").saveAsTable("tableName")

But does the append mode make sure it will avoid duplicating of rows? eg:

if row A is in hive table and its also in spark dataframe
appending spark dataframe to hive table will result in two rows of A?

Is there a way to make sure duplication doesn't happen while appending?

Edit: There are two ways to go:

one mentioned by shu, load hive table as spark dataframe, merge two dataframes, drop duplicates and write back to hive table with mode as 'overwrite'.
second, load hive table to temp table, append dataframe to temp table, get distinct rows and overwrite temp table back to hive table.

What I am looking for is, is there a way to do all of it directly without having an intermediate step of writing data to some temp table or dataframe?

Thank you.

Shu Shu · Accepted Answer · 2018-10-07T13:29:29

Appending spark dataframe to hive table will result two rows of A.

To make sure duplication doesn't happen while Appending then

You need to load the entire hive table into another data frame(df1) before appending the data into the table.
Then union two data frames(df,df1) and apply .dropDuplicates() (or) window functions to get your required record from the duplicated records.

Example:

union_df=df.union(df1)

union_df.dropDuplicates() (or) use window function
Then write the data into new table(union_df.write().mode("append").saveAsTable("newtableName")) then finally select the data from the new table overwrite the existing hive table.

Example:

spark.sql("insert overwrite table <database>.<existing_table> select * from <database>.<new_table>"))
Finally drop the new table that spark.sql("drop table <database>.<new_table>")

By using this way you are going to have data not duplicated in the hive table.

How to insert Spark DataFrame to Hive Internal table without duplicating?

2 Answers