SQL. Delete duplicate records based on multiple criteria

Question

I have a TABLE called Food with a few thousand records. Each record is a food consisting of a Description and 5 categories. Each succeeding category is more specific than the last. Description, cat1, and cat2 cannot accept NULL values. Cat3, cat4, and cat5 can accept NULL values.

For example:

description | cat1 | cat2    | cat3  | cat4      | cat5
Green apple | Food | produce | NULL  | apple     | NULL
Root beer   | Bev  | Non-Alc | NULL  | Root beer | NULL

The issue I'm having is that there are lots of duplicate descriptions assigned different categories. And I need help writing a code that can delete these duplicates by certain conditions.

Example:

description    | cat1 | cat2     | cat3     | cat4  | cat5
Mango Syrup    | Food | Produce  | NULL     | Mango | NULL
Mango Syrup    | Food | Dry Good | NULL     | NULL  | Syrup
Pepperoni Pizza| Food | Meat     | Pepperoni| NULL  | NULL
Pepperoni Pizza| Food | Bakery   | NULL     | Pizza | NULL

I have hundreds of these, and luckily, they're all set up the same way. Out of the three columns cat3, 4, and 5, only one can be populated. If a record's cat5 IS NOT NULL, cat3 and cat4 will be NULL. If a record's cat4 IS NOT NULL, cat3 and cat5 will be NULL. If a record's cat3 IS NOT NULL, cat4 and cat5 will be NULL.

The correct duplicate is the item whose record has the most specific category populated. All other duplicates should be deleted.

Things I've tried: GROUP BY and using MIN() or MAX() to aggregate different categories. This doesn't give me the result I want.

Anything will help. Appreciate it.

Could you have 2 rows where cat3 has been populated both times, and both cat4 and cat5 have the value NULL? For the duplicate data, what are you wanting as the end resultset? Are you looking to update the table, or simply have a "clean" result set returned instead — Larnu
So the records you want to keep from the example are the second and last, is that correct? — Zohar Peled
Seems to me that the biggest issue here is your table is not well normalized. — Sean Lange

Gordon Linoff Gordon Linoff · Accepted Answer · 2018-10-31T15:57:22

So for each description, you want the row that has the "biggest" category filled.

One method uses row_number():

select t.*
from (select t.*,
             row_number() over (partition by description
                                order by case when cat5 is null then 1
                                              when cat4 is null then 2
                                              when cat3 is null then 3
                                              when cat2 is null then 4
                                              when cat1 is null then 5
                                         end
                                ) as seqnum
      from t
     ) t
where seqnum = 1;

If you actually want to delete the other rows, then use an updatable CTE:

with todelete as (
      select t.*
      from (select t.*,
                   row_number() over (partition by description
                                      order by case when cat5 is null then 1
                                                    when cat4 is null then 2
                                                    when cat3 is null then 3
                                                    when cat2 is null then 4
                                                    when cat1 is null then 5
                                               end
                                      ) as seqnum
            from t
           ) t
      )
delete from todelete
    where seqnum > 1;

SQL. Delete duplicate records based on multiple criteria

1 Answers