1
votes

Does anyone have experience using the Record Linkage Toolkit with extremely large datasets? I have a few questions. Utlimately, I need to deploy it to an EC2 instance, but for now, I'm trying to figure out how to take advantage of parallel processing - I'll want to do the same on EC2.

When specifying the number of cores (njobs), the code actually runs significantly SLOWER than if I don't specify multiple cores.

compare_dupes = rl.Compare(n_jobs=12)

Related to this - I am working on a record set with 12 million customer records that need to be deduped. Currently I'm blocking on first name, last name, and zip code.

However, the number of potential record pairs index is still so large that it causes memory failures. I have tried Dask - no luck. I'm not sure what else to try. Anyone have suggestions? My code looks like this:

# this section creates a huge multi-index, which causes memory failures
dupe_indexer = rl.Index()
dupe_indexer.block(['first_name_clean','last_name_clean','zip_clean'])
dupe_candidate_links = dupe_indexer.index(df_c)


# I can put n_jobs=12 (the number of cores) in the Compare function below, 
# but for some reason it actually performs worse

compare_dupes = rl.Compare()
compare_dupes.string('first_name_clean','first_name_clean', method='jarowinkler', threshold=0.85, label='first_name_cl')
compare_dupes.string('last_name_clean','last_name_clean', method='jarowinkler', threshold=0.85, label='last_name_cl')
compare_dupes.string('email', 'email', method='jarowinkler', threshold=0.90, label='email_cl')
compare_dupes.string('address_clean','address_clean', method='damerau_levenshtein', threshold=0.6, label='address_cl')

compare_dupes.string('zip_clean','zip_clean', method='jarowinkler',threshold=0.90, label='zip_cl'
dupe_features = compare_dupes.compute(dupe_candidate_links, df_c).reset_index()

I have also tried the "index_split" method:

 s = rl.index_split(dupe_candidate_links, 100)
 for chunk in s:  

which works for reasonably size data sets < 2 million, but when the size of the dataset gets beyond that 5, 8, 10, 15, 20 million - even the index can't fit into memory.

Thanks for any support!

1
do you solve your problem, i am having the same issue. - NĂ¡thali

1 Answers

0
votes
indexer = recordlinkage.Index()
#Create indexing object
indexer = rl.SortedNeighbourhoodIndex(on='X')
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(A, B)


%%time
comp = recordlinkage.Compare()
comp.string('X', 'X', method='jarowinkler', threshold=0.60)
mymatchestwonew = comp.compute(candidate_links, A, B)