1
votes

I would like to know the most efficient way to insert records from one table to another based on whether the record has changed. Along with the insertion, an update would also need to be performed.

Some key notes. The most recent record will have an endDate of 2100-12-31 to signify it is open-ended. The strtDate is a copy of theTimestamp. I am working with the Snowflake SQL environment. I am unable to use User-Defined Functions.


Suppose I have a Table1:

ID      primKey1  primKey2  checkVar1   checkVar2   theTimestamp   strtDate    endDate
100     1         2         302.1       423.5       2001-07-13     2001-07-13  2100-12-31
101     3         6         506.4       236.7       2005-10-25     2005-10-25  2100-12-31

And I want to insert Table2:

ID      primKey1  primKey2  checkVar1   checkVar2   theTimestamp
100     1         2         302.1       423.5       2001-10-31
101     3         6         767.9       236.7       2006-12-05  

The variables I want to check on whether a record has changed is checkVar1 and checkVar2. In this scenario, the record for ID=100, did not change in the insertion table (Table2), so I don't want to insert this record. But, ID=101 did change, so I want to insert this record.

Here is how Table1 should now look:

ID      primKey1  primKey2  checkVar1   checkVar2   theTimestamp   strtDate    endDate
100     1         2         302.1       423.5       2001-07-13     2001-07-13  2100-12-31
101     3         6         506.4       236.7       2005-10-25     2005-10-25  *2006-12-05*
101     3         6         767.9       236.7       2006-12-05     2006-12-05  2100-12-31

As you can see, the endDate for the old record has updated with the new record's theTimestamp. Then the new record is inserted as a continuation of the old record by taking on the 2100-12-31 endDate. So there needs to be both an UPDATE and an INSERTION at the same time.


My Method:

WITH newTable2Rows AS (
SELECT DISTINCT ID, primKey1, primKey2
FROM Table2
)

WITH maxTable1Rows AS (
SELECT A.ID, A.primKey1, A.primKey2, A.checkVar1, A.checkVar2, A.theTimestamp, A.strtDate, MAX(A.endDate)
FROM Table1 A
JOIN newTable2Rows B
ON A.ID = B.ID, A.primKey1 = B.primKey1, A.primKey2 = B.primKey
GROUP BY A.ID, A.primKey1, A.primKey2, A.checkVar1, A.checkVar2, A.theTimestamp, A.strtDate
)

INSERT INTO Table1 (
ID, primKey1, primKey2, checkVar1, checkVar2, theTimestamp, strtDate, endDate
)
SELECT
ID, primKey1, primKey2, checkVar1, checkVar2, theTimestamp, theTimestamp AS strtDate, '2100-12-31' AS endDate
FROM Table2
MINUS maxTable1Rows

There is a little bit of pseudo code at the end because I haven't completed it yet. But basically I wanted to subtract the max Table1 rows from Table2 so that the duplicate rows are deleted from Table2. This will leave me with unique updated rows from Table2. After this, I will still need to update the max rows from Table1 with '2100-12-31'.

The issue is that storing full rows into the maxTable1Rows table is very expensive. I am dealing with tables which contain 100gb+ of data. The datasets I work with contain over 28 million records and 200+ columns. So I am looking for a method which can perform the UPDATE and INSERT in the most efficient way. Any help would be greatly appreciated.

1
I will add a new boolean field 'NewRow' in table 2. First update this field for all the rows that are changed. Then update the existing rows in Table1, then Insert the new rows. At the end, set 'NewRow' to false, to avoid doing the same thing over and over in future activation. If you cannot use a field, create a temporary table that will contains all the 'NewRow' IDs. - Nicola Lepetit

1 Answers

0
votes

Isn't this just a simple use for the MERGE statement? Snowflake MERGE

The MERGE gives you full control to compare columns and do either inserts or updates based on your criteria.