So command to append spark dataframe directly to hive table is,
df.write().mode("append").saveAsTable("tableName")
But does the append mode make sure it will avoid duplicating of rows? eg:
- if row A is in hive table and its also in spark dataframe
- appending spark dataframe to hive table will result in two rows of A?
Is there a way to make sure duplication doesn't happen while appending?
Edit: There are two ways to go:
- one mentioned by shu, load hive table as spark dataframe, merge two dataframes, drop duplicates and write back to hive table with mode as 'overwrite'.
- second, load hive table to temp table, append dataframe to temp table, get distinct rows and overwrite temp table back to hive table.
What I am looking for is, is there a way to do all of it directly without having an intermediate step of writing data to some temp table or dataframe?
Thank you.