N.B. I took your question to mean, that 2 rows are considered duplicates if two or more of their column values are equal. If you were just looking for duplicate values in columns for the same row, @GordonLinoff's answer is more appropriate
In MySQL booleans are actually represented as 0 or 1
For your three column example, the condition
(a.col1 = b.col1) + (a.col2 = b.col2) + (a.col3 = b.col3) >= 2
should do the trick
For example, assuming you have a unique id
column:
SELECT *
FROM your_table a
WHERE EXISTS (
SELECT 1
FROM your_table b
WHERE (a.col1 = b.col1) + (a.col2 = b.col2) + (a.col3 = b.col3) >= 2
AND a.id != b.id /** Don't consider the same row */
)
UPDATE
I'm not surprised that you get a large time difference between 1k and 130k. I imagine the scale will be linear so 15s * 130/1 = 1950s, which is about 30 minutes for querying on the full table.
Also don't forget that for each row, the query is checking all the other rows for duplicates. This is why just fetching the whole table is faster.
I would hope that you only need to use this query as a one-off to identify the dupes. If not, it would point to some iffy database design and likely the table could be refactored to better suit its purpose. This is the XY Problem that @apokryfos is referring to in his comment on your question.
The above query will not allow the use of any indexes on the columns, due to the complicated condition.
You could potentially reach a result faster by using UNION ALL
, assuming you have an individual index on some of the columns and id
is the PK of the table.
SELECT base.*
FROM your_table base
JOIN (
SELECT a.id, 1 col_match
FROM your_table a
WHERE EXISTS (
SELECT 1
FROM your_table b
WHERE b.col1 = a.col1
AND b.id != a.id
)
UNION ALL
SELECT a.id, 1 col_match
FROM your_table a
WHERE EXISTS (
SELECT 1
FROM your_table b
WHERE b.col2 = a.col2
AND b.id != a.id
)
UNION ALL
SELECT a.id, 1 col_match
FROM your_table a
WHERE EXISTS (
SELECT 1
FROM your_table b
WHERE b.col3 = a.col3
AND b.id != a.id
)
) raw
ON raw.id = base.id
GROUP BY base.id
HAVING SUM(raw.col_match) >= 2
It may look hefty, but could easily be built dynamically in an application layer for your ten plus columns.
Bear in mind, if you have more duplicates than uniques, it may make sense to invert this logic.