I am currently using RapidFuzz in order to find similar strings that I have within a column in my dataframe. I am currently using process.cdist in order to output a matrix with the distance/similarity between each string in the form of an ndarray. I am using this method since it is the fastest I have tried so far and my dataframe has around 400,000 rows.
I was wondering if there was a way that would print a tuple or anything else that would include the string and the similarity score together instead of just the score. Here is my code along with documentation of process.cdist for anyone who is curious: https://maxbachmann.github.io/RapidFuzz/Usage/process.html
strings1= df100k['usernames']
A = process.cdist(strings1, strings1, scorer=fuzz.ratio)
Here is what it outputs:
[[100. 25. 26.666666 ... 28.571428 40. 12.5 ]
[ 25. 100. 11.764706 ... 25. 11.764706 33.333332]
[ 26.666666 11.764706 100. ... 26.666666 25. 11.764706]
...
[ 28.571428 25. 26.666666 ... 100. 40. 12.5 ]
[ 40. 11.764706 25. ... 40. 100. 11.764706]
[ 12.5 33.333332 11.764706 ... 12.5 11.764706 100. ]]
Wanted output:
[[(abc, 100.), (def, 25.)....]
Thank you in advance!