0
votes

How to skip an unmatched line in input on replacing by regex?

For Ex. Below is the contents of my test.txt

[email protected]
[email protected]
elke engineering ltd.,@yahoo.com
[email protected]
[email protected]

Below is my Autohotkey script with regex code

ReplaceEmailsRegEx := "i)([a-z0-9]+(\.*|\_*|\-*))+@([a-z][a-z0-9\-]+(\.|\-*\.))+[a-z]{2,6}"
RemoveDuplicateCharactersRegEx := "s)(.)(?=.*\1)"

Try{
FileRead, EmailFromTxtFile, test.txt
OtherThanEmails :=RegExReplace(EmailFromTxtFile,ReplaceEmailsRegEx)
Chars :=RegExReplace(OtherThanEmails,RemoveDuplicateCharactersRegEx)
Loop{
StringReplace, OtherThanEmails, OtherThanEmails, `r`n`r`n,`r`n, UseErrorLevel
If ErrorLevel = 0
Break
}
If (StrLen(OtherThanEmails)){
Msgbox The Characters found other than email:`n%OtherThanEmails%
}
}
catch e {
ErrorString:="what: " . e.what . "file: " . e.file . " line: " . e.line . " msg: " . e.message . " extra: " . e.extra
Msgbox An Exception was thrown`n%ErrorString%
}
Return

When it replace on test.txt it throws error:

e.what contains 'RegExReplace', e.line is 10

It executes without error when I remove 3rd email in test.txt. So how to change my regex to skip the problematic string?

1
It exits from the execution of the whole file on error. So it skips remaining valid email matches - Dhay
For the person who downvoted: May I know the reason so that I can improve my next posts to be useful. - Dhay
You got a classical catastrophic backtracking with your regex. Where did you get this pattern from? Please try i)[a-z0-9]+(?:(?:\.+|_+|-+)[a-z0-9]+)*@([a-z][-a-z0-9]+\.)+[a-z]{2,6}. Or i)[a-z0-9]+(?:([._-])\1*[a-z0-9]+)*@([a-z][-a-z0-9]+\.)+[a-z]{2,6} - Wiktor Stribiżew
@WiktorStribiżew it worked. Your's is the answer. The catastrophic backtracked regex was created by me. That's why it worked like a charm. lol - Dhay
I think the downvote is due to the question itself - matching emails is so common a task that you can easily find a better regex for this by just searching SO via Google (I find Google search better than SO built-in one). - Wiktor Stribiżew

1 Answers

1
votes

The problem you have is catastrophic backtracking due to the nested quantifier in the beginning: ([a-z0-9]+(\.*|\_*|\-*))+. Here, the ., _ and - are all optional due to the * quantifier and thus your pattern gets reduced to ([a-z0-9]+)+.

I suggest "unrolling" the first subpattern to make it linear:

i)[a-z0-9]+(?:(?:\.+|_+|-+)[a-z0-9]+)*@([a-z][-a-z0-9]+\.)+[a-z]{2,6}

Or

i)[a-z0-9]+(?:([._-])\1*[a-z0-9]+)*@(?:[a-z][-a-z0-9]+\.)+[a-z]{2,6}

You may even remove \1* if you do not allow more than 1 . or _ or - in between "words".

Also, there is no need in using \-* with alternation in (\.|\-*\.), as the hyphen is matched with the previous character class, thus, this subpattern can be reduced to \..

See the regex demo