1
votes

I have a regex that is to only match alphanumeric characters,".", and "_" both before and after the @ sign. It is to match only the following TLDs:

com, org, edu, gov, uk, net, ca, de, jp, fr, au, us, ru, ch, it, nl, se, no, mil, biz, io, cc, co, info

For example, it should match [email protected] and [email protected], but not [email protected] (contains hyphen and non matching TLD) or [email protected] (.net is a matching TLD, but .neta is not)

I have the following regex:

my $email_regex = qr/[a-zA-Z0-9._]+\@[a-zA-Z0-9._]+\.(com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|mil|biz|io|cc|co|info)/;

This is matching correctly up to the appropriate TLD, but then if the TLD has any additional alphanumeric characters after it, it is still counting it as a match (which it shouldn't), it just doesn't display any alphanumeric characters after the TLD.

input:

[email protected] [email protected]<sender: [email protected] >
[email protected],[email protected]{}[email protected];
[email protected]
[email protected] : test
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]<;Senderfailure>
[email protected]

output (I have inserted comments to indicate what matched correctly and what shouldn't have matched but did anyways):

[email protected] #correct
[email protected] #correct
[email protected] #should not match
[email protected] #correct
[email protected] #correct
[email protected] #should not match
[email protected] #correct
[email protected] #correct
[email protected] #correct
[email protected] #correct 
[email protected] #correct
[email protected] #should not match
[email protected] #correct
[email protected] #correct
[email protected] #should not match

EDIT: input file contains many other characters after the email such as < , > , :, ;, " these are okay and can still be matched, just not included in the output as seen above.

1
Need to set up the regex to match only when ...co|info is the last thing in the string; the way it's now the regex matches the given pattern but it's OK if the string then have more after it. So you need to add the end-of-string anchor. So ...co|info)$/ (or \Z)zdim
... and you should also add the beginning-of-string achor, to make sure that you are not accepting junk at the beginning, i.e. = qr/^.Stefan Becker
@zdim this is still not matching the correct number of emails, take a look at my edit maybe that could help?learningunix717
Ah, that's different: change the anchor to word-boundary, \b, as @Nick says in their answer, except that you may have to allow < instead.zdim
Why should [email protected] not match? It seems to fulfill your requirements?Stefan Becker

1 Answers

3
votes

Since you are trying to find these within a larger string, you need to define what characters would not be considered part of the email address (I will assume any characters that you have not specified as allowed) so that you can anchor the beginning and ending of each match. Regexes will keep trying every possibility until they find a substring that matches, so unless you define these constraints, you will end up with the biggest chunks of what you consider "emails" that match your rules. One approach is to extract all possible strings of characters you allow, then run a second regex (your original regex), anchored to the beginning and end with \A and \z, to validate its format and the TLDs you want to allow.

Also note that since TLDs are not case sensitive, you probably want the /i regex modifier.

foreach my $email ($str =~ m/([a-zA-Z0-9._@]+)/g) {
    next unless $email =~ m/\A...\z/i;
}

Your regex is also woefully incomplete, email addresses are complex. (If you want to see what a complete email address parsing regex looks like, check out Email::Valid.) If you want to allow more valid email addresses and are flexible in your approach, I recommend using Email::Address::XS to parse them.

use strict;
use warnings;
use Email::Address::XS;

my $tld_re = qr/\.(com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|mil|biz|io|cc|co|info)\z/i;

my $address = Email::Address::XS->parse_bare_address($email);

if ($address->is_valid and $address->host =~ m/$tld_re/) {
   # matches
}