I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.