List of items find almost duplicates

Question

Within excel I have a list of artists, songs, edition. This list contains over 15000 records. The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.

So for example some records:

ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official

Each dash indicates a separate column (so 3 columns A, B, C are filled in)

How would I mark them as duplicates within Excel?

I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.

Any regex magic or vba script what can help me out? It'd also be alright to see how much similar the row is (say 80% similar).

This is the only fuzzy matching code I've found to be useful: mrexcel.com/forum/excel-questions/… — Darren Bartrup-Cook

Hambone Hambone · Accepted Answer · 2016-08-19T15:34:13

One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:

https://stackoverflow.com/a/4243652/1278553

From there, you can use the function directly in your spreadsheet to find similarities between instances:

You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:

select
  s1.group, s2.group, s1.song, s2.song,
  levenshtein (s1.group, s2.group) as group_match,
  levenshtein (s1.song, s2.song) as song_match
from
  songs s1
  cross join songs s2
order by
  group_match, song_match

Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.

List of items find almost duplicates

3 Answers