0
votes

I am currently using RapidFuzz in order to find similar strings that I have within a column in my dataframe. I am currently using process.cdist in order to output a matrix with the distance/similarity between each string in the form of an ndarray. I am using this method since it is the fastest I have tried so far and my dataframe has around 400,000 rows.

I was wondering if there was a way that would print a tuple or anything else that would include the string and the similarity score together instead of just the score. Here is my code along with documentation of process.cdist for anyone who is curious: https://maxbachmann.github.io/RapidFuzz/Usage/process.html

strings1= df100k['usernames']
A = process.cdist(strings1, strings1, scorer=fuzz.ratio)

Here is what it outputs:

[[100.        25.        26.666666 ...  28.571428  40.        12.5     ]
 [ 25.       100.        11.764706 ...  25.        11.764706  33.333332]
 [ 26.666666  11.764706 100.       ...  26.666666  25.        11.764706]
 ...
 [ 28.571428  25.        26.666666 ... 100.        40.        12.5     ]
 [ 40.        11.764706  25.       ...  40.       100.        11.764706]
 [ 12.5       33.333332  11.764706 ...  12.5       11.764706 100.      ]]

Wanted output:

[[(abc, 100.),       (def, 25.)....]

Thank you in advance!