2
votes

I was trying to find duplicates on column rows, but as they are fuzzy (not the same value, misspelling, indent space) I have to use pg_trgm extensions and similarity() function to find those. The problem is: this query is quite long and inefficent, even if I place all possible indexes.

My setup: PostgreSQL 11 pg_trgm enabled tablename:

id        col_name    fk_id
1         thing       2
2          thing      3
3         thing1      1
4         th1ng       4

There is almost 10k rows in this table, just for getting understanding what I'm dealing with.

I created this index:

CREATE INDEX CONCURRENTLY index_nameof_streets_trgm
ON tablename
USING gin (col_name gin_trgm_ops);

And ran this query (I didn't found any other way to compare column rows to themselves except self-join)

SELECT f1.col_name, f2.col_name, similarity(f1.col_name, f2.col_name)
FROM tablename f1 
INNER JOIN 
tablename f2 ON f1."Id" <> f2."Id"
WHERE similarity > 0.7

Damn, it took more than 1200secs and still not finished! (Actually, it's not very unexpected as I got this explain on query):

Nested Loop  (cost=0.00..1748422.51 rows=99870042 width=4)
  Join Filter: (f1."Id" <> f2."Id")
  ->  Seq Scan on "Streets" f1  (cost=0.00..260.94 rows=9994 width=37)
  ->  Materialize  (cost=0.00..310.91 rows=9994 width=37)
    ->  Seq Scan on "Streets" f2  (cost=0.00..260.94 rows=9994 width=37)

I feel like I'm missing something simple and almost dumb, but I can't find what exactly. Any hint on how to find fuzzy duplicates in single column would be appreciated! Thanks :)

Have a look at the fuzzystrmatch extension. Maybe group by metaphone (and experiment with the metaphone length). - teppic
Unfortunately, it's not a solution in my case as it wouldn't work fine with UTF-8, and also my data is not English-speaking only, so that's a problem. Thanks for a suggestion anyway! - Pavel Nasevich
What are the target languages? dmetaphone is supposed to handle non-english words better. As for character encoding, you should be able to clean that up (stripping accents an the like). - teppic
In any case, you will need a fingerprinting function similar to metaphone if you want to find duplicates. Your present approach does not provide discrete groupings, but rather joins each row with all similar rows, providing a (very large) number of overlapping sets. - teppic