How do I perform deduplication with the python record linkage toolkit with large data sets?

Question

I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.

The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.

Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.

misterducky misterducky · Accepted Answer · 2021-10-11T20:57:10

While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.

Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.

candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)

This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.

for x in iter(lambda: list(islice(candidates,1000000)),[]):

I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.

I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).

How do I perform deduplication with the python record linkage toolkit with large data sets?

1 Answers