1
votes

I have a table in my database to store user data. I found a defect in the code that adds data to this table database where if a network timeout occurs, the code updated the next user's data with the previous user's data. I've addressed this defect but I need to clean the database. I've added a flag to indicate the rows that need to be ignored and my goal is to mark these flags accordingly for duplicates. In some cases, though, duplicate values may actually be legitimate so I am more interested in finding several user's with the same data (i.e, u> 2).

Here's an example (tablename = Data):

id---- user_id----data1----data2----data3----datetime-----------flag

1-----usr1--------3---------- 2---------2---------2012-02-16..-----0

2-----usr2--------3---------- 2---------2---------2012-02-16..-----0

3-----usr3--------3---------- 2---------2---------2012-02-16..-----0

In this case, I'd like to mark the 1 and 2 id flags as 1 (to indicate ignore). Since we know usr1 was the original datapoint (assuming the oldest dates are earlier in the list).

At this point there are so many entries in the table that I'm not sure the best way to identify the users that have duplicate entries.

I'm looking for a mysql command to identify the problem data first and then I'll be able to mark the entries. Could someone guide me in the right direction?

2

2 Answers

0
votes

Well, first select duplicate data with their min user id:

CREATE TEMPORARY TABLE duplicates
  SELECT MIN(user_id), data1,data2,data3
  FROM data
  GROUP BY data1,data2,data3
  HAVING COUNT(*) > 1                       -- at least two rows
     AND COUNT(*) = COUNT(DISTINCT user_id) -- all user_ids must be different
     AND TIMESTAMPDIFF( MINUTE, MIN(`datetime`), MAX(`datetime`)) <= 45;

(I'm not sure, if I used TIMESTAMPDIFF properly.) Now we can update the flag in those rows where user_id is different:

UPDATE     duplicate
INNER JOIN data      ON data.data1 = duplicate.data1
                    AND data.data2 = duplicate.data2
                    AND data.data3 = duplicate.data3
                    AND data.user_id != duplicate.user_id
SET data.flag = 1;
0
votes
UPDATE Data A
LEFT JOIN
(
    SELECT user_id,data1,data2,data3,min(id) min_id
    FROM Data GROUP BY user_id,data1,data2,data3
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);

If there are duplicate times involved, maybe try this

UPDATE Data A
LEFT JOIN
(
    SELECT user_id,data1,data2,data3,,`datetime`,min(id) min_id
    FROM Data GROUP BY user_id,data1,data2,data3,`datetime`
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);