1
votes

In hive, how can I delete duplicate records ? Below is my case,

First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table

FROM products INSERT OVERWRITE TABLE products_rcfileformat         
SELECT *;

Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause

FROM products INSERT INTO TABLE products_rcfileformat         
SELECT *;

When I query the data it give me total rows = 50 which are right

enter image description here

Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0

enter image description here

Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?

2

2 Answers

1
votes

Partition your data such that the rows (use window function row_number) you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.

detail about Partition .

www.tutorialspoint.com/hive/hive_partitioning.htm

0
votes

Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0

The reason is hdfs is read only, not editable, as hive warehouse files (or whatever may be the location) that is still in hdfs, so it has to create a second file.

Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ?

Please check this post - Removing DUPLICATE rows in hive based on columns.

Let me know if you are satisfied with the answer there. I have another method, which removes duplicate entries but may not be in the way you want.