0
votes

I would like to convert this SQL query into ANSI SQL. I am having trouble wrapping my head around the logic of this query.

I use Snowflake Data Warehouse, but it does not understand this query because of the 'delete' statement right before join, so I am trying to break it down. From my understanding the row number column is giving me the order from 1 to N based on timestamp and placing it in C. Then C is joined against itself on the rows other than the first row (based on id) and placed in C1. Then C1 is deleted from the overall data, which leaves only the first row.

I may be understanding the logic incorrectly, but I am not used to seeing the 'delete' statement right before a join. Let me know if I got the logic right, or point me in the right direction.

This query was copy/pasted from THIS stackoverflow question which has the exact situation I am trying to solve, but on a much larger scale.

with C as
(
  select ID,
         row_number() over(order by DT) as rn
  from YourTable
)
delete C1
from C as C1
  inner join C as C2
    on C1.rn = C2.rn-1 and
       C1.ID = C2.ID

The specific problem I am trying to solve is this. Let's assume I have this table. I need to partition the rows by primary key combinations (primKey 1 & 2) while maintaining timestamp order.

ID      primKey1  primKey2  checkVar1   checkVar2   theTimestamp 
100     1         2         302         423         2001-07-13
101     3         6         506         236         2005-10-25
100     1         2         302         423         2002-08-15
101     3         6         506         236         2008-12-05
101     3         6         300         100         2010-06-10
100     1         2         407         309         2005-09-05
100     1         2         302         423         2012-05-09
100     1         2         302         423         2003-07-24

Once the rows are partitioned and the timestamp is ordered within each partition, I need to delete the duplicate checkVar combination (checkVar 1 & 2) rows until the next change. Thus leaving me with the earliest unique row. The rows with asterisks are the ones which need to be removed since they are duplicates.

ID      primKey1  primKey2  checkVar1   checkVar2   theTimestamp 
100     1         2         302         423         2001-07-13
*100    1         2         302         423         2002-08-15
*100    1         2         302         423         2003-07-24
100     1         2         407         309         2005-09-05
100     1         2         302         423         2012-05-09

101     3         6         506         236         2005-10-25
*101    3         6         506         236         2008-12-05
101     3         6         300         100         2010-06-10

This is the final result. As you can see for ID=100, even though the 1st and 3rd record are the same, the checkVar combination changed in between, which is fine. I am only removing the duplicates until the values change.

ID      primKey1  primKey2  checkVar1   checkVar2   theTimestamp 
100     1         2         302         423         2001-07-13
100     1         2         407         309         2005-09-05
100     1         2         302         423         2012-05-09

101     3         6         506         236         2005-10-25
101     3         6         300         100         2010-06-10
1
Please explain the logic that you want to implement. It is not obvious from the query. - Gordon Linoff
deleting based on a CTE is one of those things that would work for Sql Server, but not on other DBMS. But for standard SQL there's probably an extra field required that's unique in the table. Obviously, that ID isn't the Primary Key in that table, since it has dups. - LukStorms
Hey Gordon and LukStorms. I edited the problem to show what I am trying to accomplish. Although in my situation, there is no real concept of primary keys since the primary keys I use are not unique. They are mostly used for partitioning purposes. Although, the combination of primary keys and timestamp would always be unique. - Theyallgoleft
Are you trying to delete from the CTE or from YourTable? If this is a 1-time thing, you might want to just using a CTAS and create a new table. If this is an on-going thing, then you might want to create a materialized view over the base table. - Mike Walton

1 Answers

0
votes

If you want to keep the earliest row for each id, then you can use:

delete from yourtable yt
    where yt.dt > (select min(yt2.dt)
                   from yourtable yt
                   where yt2.id = yd.id
                  );

Your query would not do this, if that is your intent.