Hive delete duplicate records

Question

In hive, how can I delete duplicate records ? Below is my case,

First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table

FROM products INSERT OVERWRITE TABLE products_rcfileformat         
SELECT *;

Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause

FROM products INSERT INTO TABLE products_rcfileformat         
SELECT *;

When I query the data it give me total rows = 50 which are right

Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0

Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?

sandeep rawat sandeep rawat · Accepted Answer · 2016-05-07T09:59:00

Partition your data such that the rows (use window function row_number) you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.

detail about Partition .

www.tutorialspoint.com/hive/hive_partitioning.htm

Hive delete duplicate records

2 Answers