Delete duplicate records in oracle table : size is 389 GB

Question

Need to delete duplicate records from the table. Table contains 33 columns out of them only PK_NUM is the primary key columns. As PK_NUM contains unique records we need to consider either min/max value.

Total records in the table : 1766799022
Distinct records in the table : 69237983
Duplicate records in the table : 1697561039

Column details :

4 : Date data type
4 : Number data type
1 : Char data type
24 : Varchar2 data type

Size of table : 386 GB

DB details : Oracle Database 11g EE::11.2.0.2.0 ::64bit Production

Sample data :

col1 ,col2,col3
1,ABC,123
2,PQR,456
3,ABC,123

Expected data should contains only 2 records:

col1,col2,col3
1,ABC,123
2,PQR,456

*1 can be replaced by 3 ,vice versa.

My plan here is to

Pull distinct records and store it in a back up table.(ie by using insert into select)
Truncate existing table and move records from back up to existing.

As data size is huge ,

Want to know what is the optimized sql for retrieving the distinct records
Any estimate on how much it will take to complete (insert into select) and to truncate the existing table.

Please do let me know, if there is any other best way to achieve this. My ultimate goal is to remove the duplicates.

From my experience, the insert will be a lot quicker than the delete. So your approach sounds good. You can make it even quicker when you drop the original table and the backup to the old name and re-create all constraints. Then you save the work of re-inserting the rows — a_horse_with_no_name
How much memory does the machine have available, or what is the maximum available size for the PGA? — David Aldridge
instead of insert of distincts .. go for a CTAS (which will be w/o indexes and faster too) which might be more faster — SriniV

David Aldridge David Aldridge · Accepted Answer · 2013-11-03T08:58:11

One option for making this memory efficient is to insert (nologging append) all of the rows into a table that is hash partitioned on the list of columns on which duplicates are to be detected, or if there is a limitation on the number of columns then on as many as you can use (aiming to use those with maximum selectivity). Use something like 1024 partitions, and each one will ideally be around

You have then isolated all of the potential duplicates for each row into the same partition, and standard methods for deduplication will run on each partition without as much memory consumption.

So for each partition you can do something like ...

insert /*+ append */ into new_table
select *
from   temp_table partition (p1) t1
where  not exists (
         select null
         from   temp_table partition (p1) t2
         where  t1.col1 = t2.col1 and
                t1.col2 = t2.col2 and
                t1.col3 = t2.col3 and
                ... etc ...
                t1.rownum < t2.rownum);

The key to good performance here is that the hash table created to perform the anti-join in that query, which is going to be nearly as big as the partition itself, be able to fit in memory. So if you can manage a 2GB sort area you need at least 389/2 = approx 200 table partitions. Round up to the nearest power of two, so make it 256 table partitions in that case.

Delete duplicate records in oracle table : size is 389 GB

2 Answers