(This answer has nothing to do with regular expressions, but it covers some use cases.)
I don't know if this works for your use case. But it looks like you are trying to find whether a word (or a close misspelling) is in your text. If the text is separated by spaces, and your word does not contain spaces, you could try something like:
nopunct(s) = filter(c -> !ispunct(c), s)
nfcl(s) = normalize_string(s, decompose=true, compat=true, casefold=true,
stripmark=true, stripignore=true)
canonicalize(s) = nopunct(nfcl(s))
fuzzy(needle, haystack, n) = any(
w -> levenshtein(w, canonicalize(needle)) < n,
split(canonicalize(haystack)))
What this does is, roughly:
nfcl
normalizes strings with similar "human" appearances, by stripping out accents, ignoring case, and performing unicode normalization. This is pretty useful for fuzzy matching:
julia> nfcl("Ce texte est en français.")
"ce texte est en francais."
nopunct
strips punctuation characters, further simplifying the string.
julia> nopunct("Hello, World!")
"Hello World"
canonicalize
simply combines these two transformations.
Then we check whether any of the words in the haystack (split by whitespace) are within n
of the needle.
Examples:
julia> fuzzy("Robert", "My name is robrt.", 2)
true
julia> fuzzy("Robert", "My name is john.", 2)
false
This is by no means a complete solution, but it covers a lot of common use cases. For more advanced use cases, you should look into the subject in more depth.