114
votes

What is the best way to find records with duplicate values across multiple columns using Postgres, and Activerecord?

I found this solution here:

User.find(:all, :group => [:first, :email], :having => "count(*) > 1" )

But it doesn't seem to work with postgres. I'm getting this error:

PG::GroupingError: ERROR: column "parts.id" must appear in the GROUP BY clause or be used in an aggregate function

5
In regular SQL, I'd use a self-join, something like select a.id, b.id, name, email FROM user a INNER JOIN user b USING (name, email) WHERE a.id > b.id. No idea how to express that in ActiveRecord-speak.Craig Ringer

5 Answers

249
votes

Tested & Working Version

User.select(:first,:email).group(:first,:email).having("count(*) > 1")

Also, this is a little unrelated but handy. If you want to see how times each combination was found, put .size at the end:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").size

and you'll get a result set back that looks like this:

{[nil, nil]=>512,
 ["Joe", "[email protected]"]=>23,
 ["Jim", "[email protected]"]=>36,
 ["John", "[email protected]"]=>21}

Thought that was pretty cool and hadn't seen it before.

Credit to Taryn, this is just a tweaked version of her answer.

39
votes

That error occurs because POSTGRES requires you to put grouping columns in the SELECT clause.

try:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").all

(note: not tested, you may need to tweak it)

EDITED to remove id column

14
votes

If you need the full models, try the following (based on @newUserNameHere's answer).

User.where(email: User.select(:email).group(:email).having("count(*) > 1").select(:email))

This will return the rows where the email address of the row is not unique.

I'm not aware of a way to do this over multiple attributes.

5
votes

Get all duplicates with a single query if you use PostgreSQL:

def duplicated_users
  duplicated_ids = User
    .group(:first, :email)
    .having("COUNT(*) > 1")
    .select('unnest((array_agg("id"))[2:])')

  User.where(id: duplicated_ids)
end

irb> duplicated_users
-1
votes

Based on the answer above by @newUserNameHere I believe the right way to show the count for each is

res = User.select('first, email, count(1)').group(:first,:email).having('count(1) > 1')

res.each {|r| puts r.attributes } ; nil