0
votes

I'm trying to find all occurrences of a word in paragraph and I want it to account for spelling mistakes as well. Code:

to_search="caterpillar"
search_here= "caterpillar are awesome animal catterpillar who like other humans but not other caterpilar"
#search_here has the word caterpillar repeated but with spelling mistakes

s= SequenceMatcher(None, to_search, search_here).get_matching_blocks()
print(s)

#Output  : [Match(a=0, b=0, size=11), Match(a=3, b=69, size=0)] 
#Expected: [Match(a=0, b=0, size=11), Match(a=0, b=32, size=11), Match(a=0, b=81, size=11)]

Difflib get_matching_blocks only detects the first instance of "caterpillar" in the search_here string. I want it to give me output of all closely matching blocks i.e. it should identify "caterpillar","catterpillar" and "caterpilar"

How can I solve this problem?

1
When you look for the difference in texts using diffing you don’t expect to find ALL possible differences between the two texts. It will give you one (1) estimate on how different the strings are and how much has to change in either of the inputs to get the other. You are using the wrong tool for the job.NewPythonUser

1 Answers

0
votes

You could calculate the edit distance of each word vs. to_search. Then, you can select all the words that have "low enough" edit distance (a score of 0 means exact match).

Thanks to your question, I have discovered that there is a pip-install-able edit_distance Python module. Here are a couple examples I just tried out for the first time:

>>> edit_distance.SequenceMatcher('fabulous', 'fibulous').ratio()
0.875
>>> edit_distance.SequenceMatcher('fabulous', 'wonderful').ratio()
0.11764705882352941
>>> edit_distance.SequenceMatcher('fabulous', 'fabulous').ratio()
1.0
>>> edit_distance.SequenceMatcher('fabulous', '').ratio()
0.0
>>> edit_distance.SequenceMatcher('caterpillar', 'caterpilar').ratio()
0.9523809523809523

So, it looks like the ratio method gives you a number between 0 and 1 (inclusive) where 1 is an exact match and 0 is... not even in the same league XD. So yeah, you could select words that have a ratio greater than 1 - epsilon, where epsilon is maybe 0.1 or thereabouts.