I was trying to find duplicates on column rows, but as they are fuzzy (not the same value, misspelling, indent space) I have to use pg_trgm extensions and similarity() function to find those. The problem is: this query is quite long and inefficent, even if I place all possible indexes.
My setup: PostgreSQL 11 pg_trgm enabled tablename:
id col_name fk_id
1 thing 2
2 thing 3
3 thing1 1
4 th1ng 4
There is almost 10k rows in this table, just for getting understanding what I'm dealing with.
I created this index:
CREATE INDEX CONCURRENTLY index_nameof_streets_trgm
ON tablename
USING gin (col_name gin_trgm_ops);
And ran this query (I didn't found any other way to compare column rows to themselves except self-join)
SELECT f1.col_name, f2.col_name, similarity(f1.col_name, f2.col_name)
FROM tablename f1
INNER JOIN
tablename f2 ON f1."Id" <> f2."Id"
WHERE similarity > 0.7
Damn, it took more than 1200secs and still not finished! (Actually, it's not very unexpected as I got this explain on query):
Nested Loop (cost=0.00..1748422.51 rows=99870042 width=4)
Join Filter: (f1."Id" <> f2."Id")
-> Seq Scan on "Streets" f1 (cost=0.00..260.94 rows=9994 width=37)
-> Materialize (cost=0.00..310.91 rows=9994 width=37)
-> Seq Scan on "Streets" f2 (cost=0.00..260.94 rows=9994 width=37)
I feel like I'm missing something simple and almost dumb, but I can't find what exactly. Any hint on how to find fuzzy duplicates in single column would be appreciated! Thanks :)
metaphone
(and experiment with the metaphone length). - teppicdmetaphone
is supposed to handle non-english words better. As for character encoding, you should be able to clean that up (stripping accents an the like). - teppicmetaphone
if you want to find duplicates. Your present approach does not provide discrete groupings, but rather joins each row with all similar rows, providing a (very large) number of overlapping sets. - teppic