fuzzy regex matching in julia

Question

Is there a way to do fuzzy regex matching in Julia?

I have constructed the following regular expression test:

toMatch = Regex(word,"i")
ismatch(toMatch,input_string)

I would like to be able to do this test but allow for some latitude in the matching and to specify this by Levenshtein distance.

I have found the package Levenshtein which can calculate the distance but am not sure how to incorporate it into this logic. For example:

levenshtein("hello","hllo")`
> 1

Do you need regex here? This sounds like a hard (computationally) problem for general regular expressions. — Fengyang Wang
It's possible that I don't need it. I first solved this problem for exact matches using the code listed here and now am trying to allow accepting of misspellings within input_string. — Aaron

Fengyang Wang Fengyang Wang · Accepted Answer · 2016-06-21T23:56:53

(This answer has nothing to do with regular expressions, but it covers some use cases.)

I don't know if this works for your use case. But it looks like you are trying to find whether a word (or a close misspelling) is in your text. If the text is separated by spaces, and your word does not contain spaces, you could try something like:

nopunct(s) = filter(c -> !ispunct(c), s)
nfcl(s) = normalize_string(s, decompose=true, compat=true, casefold=true,
                              stripmark=true, stripignore=true)
canonicalize(s) = nopunct(nfcl(s))
fuzzy(needle, haystack, n) = any(
    w -> levenshtein(w, canonicalize(needle)) < n,
    split(canonicalize(haystack)))

What this does is, roughly:

nfcl normalizes strings with similar "human" appearances, by stripping out accents, ignoring case, and performing unicode normalization. This is pretty useful for fuzzy matching:

julia> nfcl("Ce texte est en français.")
"ce texte est en francais."

nopunct strips punctuation characters, further simplifying the string.

julia> nopunct("Hello, World!")
"Hello World"

canonicalize simply combines these two transformations.

Then we check whether any of the words in the haystack (split by whitespace) are within n of the needle.

Examples:

julia> fuzzy("Robert", "My name is robrt.", 2)
true

julia> fuzzy("Robert", "My name is john.", 2)
false

This is by no means a complete solution, but it covers a lot of common use cases. For more advanced use cases, you should look into the subject in more depth.

fuzzy regex matching in julia

1 Answers