I have a table with a varchar column, and I would like to find all the records that have duplicate values in this column. What is the best query I can use to find the duplicates?
26 Answers
SELECT *
FROM mytable mto
WHERE EXISTS
(
SELECT 1
FROM mytable mti
WHERE mti.varchar_column = mto.varchar_column
LIMIT 1, 1
)
This query returns complete records, not just distinct varchar_column
's.
This query doesn't use COUNT(*)
. If there are lots of duplicates, COUNT(*)
is expensive, and you don't need the whole COUNT(*)
, you just need to know if there are two rows with same value.
This is achieved by the LIMIT 1, 1
at the bottom of the correlated query (essentially meaning "return the second row"). EXISTS would only return true if the aforementioned second row exists (i. e. there are at least two rows with the same value of varchar_column) .
Having an index on varchar_column
will, of course, speed up this query greatly.
to get all the data that contains duplication i used this:
SELECT * FROM TableName INNER JOIN(
SELECT DupliactedData FROM TableName GROUP BY DupliactedData HAVING COUNT(DupliactedData) > 1 order by DupliactedData)
temp ON TableName.DupliactedData = temp.DupliactedData;
TableName = the table you are working with.
DupliactedData = the duplicated data you are looking for.
Taking @maxyfc's answer further, I needed to find all of the rows that were returned with the duplicate values, so I could edit them in MySQL Workbench:
SELECT * FROM table
WHERE field IN (
SELECT field FROM table GROUP BY field HAVING count(*) > 1
) ORDER BY field
My final query incorporated a few of the answers here that helped - combining group by, count & GROUP_CONCAT.
SELECT GROUP_CONCAT(id), `magento_simple`, COUNT(*) c
FROM product_variant
GROUP BY `magento_simple` HAVING c > 1;
This provides the id of both examples (comma separated), the barcode I needed, and how many duplicates.
Change table and columns accordingly.
I saw the above result and query will work fine if you need to check single column value which are duplicate. For example email.
But if you need to check with more columns and would like to check the combination of the result so this query will work fine:
SELECT COUNT(CONCAT(name,email)) AS tot,
name,
email
FROM users
GROUP BY CONCAT(name,email)
HAVING tot>1 (This query will SHOW the USER list which ARE greater THAN 1
AND also COUNT)
I prefer to use windowed functions(MySQL 8.0+) to find duplicates because I could see entire row:
WITH cte AS (
SELECT *
,COUNT(*) OVER(PARTITION BY col_name) AS num_of_duplicates_group
,ROW_NUMBER() OVER(PARTITION BY col_name ORDER BY col_name2) AS pos_in_group
FROM table
)
SELECT *
FROM cte
WHERE num_of_duplicates_group > 1;
The following will find all product_id that are used more than once. You only get a single record for each product_id.
SELECT product_id FROM oc_product_reward GROUP BY product_id HAVING count( product_id ) >1
Code taken from : http://chandreshrana.blogspot.in/2014/12/find-duplicate-records-based-on-any.html
CREATE TABLE tbl_master
(`id` int, `email` varchar(15));
INSERT INTO tbl_master
(`id`, `email`) VALUES
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]'),
(4, '[email protected]'),
(5, '[email protected]');
QUERY : SELECT id, email FROM tbl_master
WHERE email IN (SELECT email FROM tbl_master GROUP BY email HAVING COUNT(id) > 1)
For removing duplicate rows with multiple fields , first cancate them to the new unique key which is specified for the only distinct rows, then use "group by" command to removing duplicate rows with the same new unique key:
Create TEMPORARY table tmp select concat(f1,f2) as cfs,t1.* from mytable as t1;
Create index x_tmp_cfs on tmp(cfs);
Create table unduptable select f1,f2,... from tmp group by cfs;
One very late contribution... in case it helps anyone waaaaaay down the line... I had a task to find matching pairs of transactions (actually both sides of account-to-account transfers) in a banking app, to identify which ones were the 'from' and 'to' for each inter-account-transfer transaction, so we ended up with this:
SELECT
LEAST(primaryid, secondaryid) AS transactionid1,
GREATEST(primaryid, secondaryid) AS transactionid2
FROM (
SELECT table1.transactionid AS primaryid,
table2.transactionid AS secondaryid
FROM financial_transactions table1
INNER JOIN financial_transactions table2
ON table1.accountid = table2.accountid
AND table1.transactionid <> table2.transactionid
AND table1.transactiondate = table2.transactiondate
AND table1.sourceref = table2.destinationref
AND table1.amount = (0 - table2.amount)
) AS DuplicateResultsTable
GROUP BY transactionid1
ORDER BY transactionid1;
The result is that the DuplicateResultsTable
provides rows containing matching (i.e. duplicate) transactions, but it also provides the same transaction id's in reverse the second time it matches the same pair, so the outer SELECT
is there to group by the first transaction ID, which is done by using LEAST
and GREATEST
to make sure the two transactionid's are always in the same order in the results, which makes it safe to GROUP
by the first one, thus eliminating all the duplicate matches. Ran through nearly a million records and identified 12,000+ matches in just under 2 seconds. Of course the transactionid is the primary index, which really helped.