1
votes

Within excel I have a list of artists, songs, edition. This list contains over 15000 records. The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.

So for example some records:

ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official

Each dash indicates a separate column (so 3 columns A, B, C are filled in)

How would I mark them as duplicates within Excel?

I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.

Any regex magic or vba script what can help me out? It'd also be alright to see how much similar the row is (say 80% similar).

3
This is the only fuzzy matching code I've found to be useful: mrexcel.com/forum/excel-questions/… - Darren Bartrup-Cook

3 Answers

1
votes

One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:

https://stackoverflow.com/a/4243652/1278553

From there, you can use the function directly in your spreadsheet to find similarities between instances:

enter image description here

You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:

select
  s1.group, s2.group, s1.song, s2.song,
  levenshtein (s1.group, s2.group) as group_match,
  levenshtein (s1.song, s2.song) as song_match
from
  songs s1
  cross join songs s2
order by
  group_match, song_match

Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.

0
votes

You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000

=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)

This way will also look back up the list, to indicate the ones found

=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))

The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +

-1
votes

try this worksheet fucntions in your loop:

=COUNTIF(Range,"*yourtexttofind*")