2
votes

I have a table with over 30 million records. When doing insert, I need to avoid the Unique constraint violation.

When I use this NOT EXIST approach, the insert takes forever. In fact, it couldn't finish after 24 hours of running. And I can't use the ignore_row_on_dupkey_index hint, because this table has more than 1 PK columns.

Another option is to insert in subsets. But I want to know if there's any other way before I do sub-setting.

insert into tlb1 a
select * from tlb2 b
where not exists (select 'x' from tlb1 c
                  where b.pk = c.pk)
1
I need to avoid the Unique constraint violation => why? shove your data to the table, let the duplicates fail, that's what's supposed to happen - thatjeffsmith
Ok, what would you do once it fails....? You would try not to insert the same record twice right? Isn't it what I did there using the NOT EXISTS? - Dozel
you need to show us how you plan on doing the 30m row insert, from an existing table, a flat file, something else? - thatjeffsmith
The query in my question is my plan. It's an insert to a Table from another table. - Dozel
Would a merge that only processed inserts be more efficient? - pmdba

1 Answers

1
votes

The important decision depends on the numbe rof row inserted, i.e. the number of the rows in the table TBL2

If this number is rather low (say in hundreds to thousands) you may use safely your approach, provided there is an index on the PK column(s) - whoch should be to enforce the unique constraint.

Please check that the used execution plan is something like the one below

-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |          |   110 |  2860 |   113   (0)| 00:00:02 |
|   1 |  LOAD TABLE CONVENTIONAL | TBL1     |       |       |            |          |
|   2 |   NESTED LOOPS ANTI      |          |   110 |  2860 |   113   (0)| 00:00:02 |
|   3 |    TABLE ACCESS FULL     | TBL2     |   110 |  1430 |     3   (0)| 00:00:01 |
|*  4 |    INDEX UNIQUE SCAN     | TBL1_IXD |     1 |    13 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("B"."PK"="C"."PK")

The NESTED LOOPS ANTI means that for each inserted row a single index lookup will be done to check if the key already exists in the target table.

This will work fine for a low number of inserted rows. For a large insert (millions rows) the optimizer will switch to a HASH JOIN RIGHT ANTI, i.e. all rows from both table will be joined to get th epossible duplicates.

This can take some time (but usually not 24 hours) and the approach with DML Error Logging which eliminates the need of the join.

INSERT INTO tbl1 (pk)
SELECT pk
FROM   tbl3
LOG ERRORS INTO err$_tbl1 ('dedup tbl3') REJECT LIMIT UNLIMITED;

This approach will scale well especially when the number of the duplicates is low compared with the number of inserted rows. It is comparable to a normal insert:

---------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |   876K|    10M|   427   (1)| 00:00:06 |
|   1 |  LOAD TABLE CONVENTIONAL | TBL1 |       |       |            |          |
|   2 |   TABLE ACCESS FULL      | TBL3 |   876K|    10M|   427   (1)| 00:00:06 |
---------------------------------------------------------------------------------