3
votes

Need to delete duplicate records from the table. Table contains 33 columns out of them only PK_NUM is the primary key columns. As PK_NUM contains unique records we need to consider either min/max value.

  1. Total records in the table : 1766799022
  2. Distinct records in the table : 69237983
  3. Duplicate records in the table : 1697561039

Column details :

  • 4 : Date data type
  • 4 : Number data type
  • 1 : Char data type
  • 24 : Varchar2 data type

Size of table : 386 GB

DB details : Oracle Database 11g EE::11.2.0.2.0 ::64bit Production

Sample data :

  • col1 ,col2,col3
  • 1,ABC,123
  • 2,PQR,456
  • 3,ABC,123

Expected data should contains only 2 records:

  • col1,col2,col3
  • 1,ABC,123
  • 2,PQR,456

*1 can be replaced by 3 ,vice versa.

My plan here is to

  1. Pull distinct records and store it in a back up table.(ie by using insert into select)
  2. Truncate existing table and move records from back up to existing.

As data size is huge ,

  1. Want to know what is the optimized sql for retrieving the distinct records
  2. Any estimate on how much it will take to complete (insert into select) and to truncate the existing table.

Please do let me know, if there is any other best way to achieve this. My ultimate goal is to remove the duplicates.

2
From my experience, the insert will be a lot quicker than the delete. So your approach sounds good. You can make it even quicker when you drop the original table and the backup to the old name and re-create all constraints. Then you save the work of re-inserting the rowsa_horse_with_no_name
How many columns do you need to compare for distinctness?David Aldridge
15 columns are needed to compare distinctnessAvin_add
How much memory does the machine have available, or what is the maximum available size for the PGA?David Aldridge
instead of insert of distincts .. go for a CTAS (which will be w/o indexes and faster too) which might be more fasterSriniV

2 Answers

2
votes

One option for making this memory efficient is to insert (nologging append) all of the rows into a table that is hash partitioned on the list of columns on which duplicates are to be detected, or if there is a limitation on the number of columns then on as many as you can use (aiming to use those with maximum selectivity). Use something like 1024 partitions, and each one will ideally be around

You have then isolated all of the potential duplicates for each row into the same partition, and standard methods for deduplication will run on each partition without as much memory consumption.

So for each partition you can do something like ...

insert /*+ append */ into new_table
select *
from   temp_table partition (p1) t1
where  not exists (
         select null
         from   temp_table partition (p1) t2
         where  t1.col1 = t2.col1 and
                t1.col2 = t2.col2 and
                t1.col3 = t2.col3 and
                ... etc ...
                t1.rownum < t2.rownum);

The key to good performance here is that the hash table created to perform the anti-join in that query, which is going to be nearly as big as the partition itself, be able to fit in memory. So if you can manage a 2GB sort area you need at least 389/2 = approx 200 table partitions. Round up to the nearest power of two, so make it 256 table partitions in that case.

1
votes

try this:

rename table_name to table_name_dup;

and then:

create table table_name 
as
select 
  min(col1)
, col2
, col3
from table_name_dup
group by 
  col2
, col3;

as far as i know the temp_tablespace used is not much as the whole group by is taking place in the target tablespace where the new table will be created. once finished, you can just drop the one with the duplicates:

drop table table_name_dup;