Does anyone have experience using the Record Linkage Toolkit with extremely large datasets? I have a few questions. Utlimately, I need to deploy it to an EC2 instance, but for now, I'm trying to figure out how to take advantage of parallel processing - I'll want to do the same on EC2.
When specifying the number of cores (njobs), the code actually runs significantly SLOWER than if I don't specify multiple cores.
compare_dupes = rl.Compare(n_jobs=12)
Related to this - I am working on a record set with 12 million customer records that need to be deduped. Currently I'm blocking on first name, last name, and zip code.
However, the number of potential record pairs index is still so large that it causes memory failures. I have tried Dask - no luck. I'm not sure what else to try. Anyone have suggestions? My code looks like this:
# this section creates a huge multi-index, which causes memory failures
dupe_indexer = rl.Index()
dupe_indexer.block(['first_name_clean','last_name_clean','zip_clean'])
dupe_candidate_links = dupe_indexer.index(df_c)
# I can put n_jobs=12 (the number of cores) in the Compare function below,
# but for some reason it actually performs worse
compare_dupes = rl.Compare()
compare_dupes.string('first_name_clean','first_name_clean', method='jarowinkler', threshold=0.85, label='first_name_cl')
compare_dupes.string('last_name_clean','last_name_clean', method='jarowinkler', threshold=0.85, label='last_name_cl')
compare_dupes.string('email', 'email', method='jarowinkler', threshold=0.90, label='email_cl')
compare_dupes.string('address_clean','address_clean', method='damerau_levenshtein', threshold=0.6, label='address_cl')
compare_dupes.string('zip_clean','zip_clean', method='jarowinkler',threshold=0.90, label='zip_cl'
dupe_features = compare_dupes.compute(dupe_candidate_links, df_c).reset_index()
I have also tried the "index_split" method:
s = rl.index_split(dupe_candidate_links, 100)
for chunk in s:
which works for reasonably size data sets < 2 million, but when the size of the dataset gets beyond that 5, 8, 10, 15, 20 million - even the index can't fit into memory.
Thanks for any support!