How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

Question

how to find duplicates on a table with three columns (col1,col2,col3) if ANY two of the columns have duplicate?

put in pseudo-query it would look like this:

If col1 and col2 are the same, then select

if col2 and col3 are the same, then select

if col1 and col3 are the same, then select

if col1,col2 and col3 are the same, then select

I don't want to use 'IF' because the number of columns are actually greater than 10 which would make the 'IF' query very tedious.

thanks.

This sounds like an XY problem. If there's 10 columns you need to de-duplicate maybe you instead have a many to many (but distinct) relationship? — apokryfos

Arth Arth · Accepted Answer · 2018-10-10T13:19:11

N.B. I took your question to mean, that 2 rows are considered duplicates if two or more of their column values are equal. If you were just looking for duplicate values in columns for the same row, @GordonLinoff's answer is more appropriate

In MySQL booleans are actually represented as 0 or 1

For your three column example, the condition

(a.col1 = b.col1) + (a.col2 = b.col2) + (a.col3 = b.col3) >= 2

should do the trick

For example, assuming you have a unique id column:

   SELECT *
     FROM your_table a 
    WHERE EXISTS (
     SELECT 1
       FROM your_table b
      WHERE (a.col1 = b.col1) + (a.col2 = b.col2) + (a.col3 = b.col3) >= 2 
        AND a.id != b.id /** Don't consider the same row */
          )

UPDATE

I'm not surprised that you get a large time difference between 1k and 130k. I imagine the scale will be linear so 15s * 130/1 = 1950s, which is about 30 minutes for querying on the full table.

Also don't forget that for each row, the query is checking all the other rows for duplicates. This is why just fetching the whole table is faster.

I would hope that you only need to use this query as a one-off to identify the dupes. If not, it would point to some iffy database design and likely the table could be refactored to better suit its purpose. This is the XY Problem that @apokryfos is referring to in his comment on your question.

The above query will not allow the use of any indexes on the columns, due to the complicated condition.

You could potentially reach a result faster by using UNION ALL, assuming you have an individual index on some of the columns and id is the PK of the table.

  SELECT base.*
    FROM your_table base
    JOIN (

    SELECT a.id, 1 col_match
      FROM your_table a
     WHERE EXISTS (
      SELECT 1
        FROM your_table b
       WHERE b.col1 = a.col1
         AND b.id != a.id
           )

     UNION ALL

    SELECT a.id, 1 col_match
      FROM your_table a
     WHERE EXISTS (
       SELECT 1
         FROM your_table b
        WHERE b.col2 = a.col2
          AND b.id != a.id
           )

     UNION ALL

    SELECT a.id, 1 col_match
      FROM your_table a
     WHERE EXISTS (
      SELECT 1
        FROM your_table b
       WHERE b.col3 = a.col3
         AND b.id != a.id
           )

         ) raw
      ON raw.id = base.id

GROUP BY base.id
  HAVING SUM(raw.col_match) >= 2

It may look hefty, but could easily be built dynamically in an application layer for your ten plus columns.

Bear in mind, if you have more duplicates than uniques, it may make sense to invert this logic.

How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

2 Answers