In hive, how can I delete duplicate records ? Below is my case,
First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table
FROM products INSERT OVERWRITE TABLE products_rcfileformat
SELECT *;
Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause
FROM products INSERT INTO TABLE products_rcfileformat
SELECT *;
When I query the data it give me total rows = 50 which are right
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0
Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?