1
votes

I have a regex that is to only match alphanumeric characters,".", and "_" both before and after the @ sign. It is to match only the following TLDs:

com, org, edu, gov, uk, net, ca, de, jp, fr, au, us, ru, ch, it, nl, se, no, mil, biz, io, cc, co, info

For example, it should match sample22_test.tester.edu@auto.gmail.mil and test@gmail.com, but not anothertest.325-2352@yahoo.pys (contains hyphen and non matching TLD) or tester1234@yahoo.neta (.net is a matching TLD, but .neta is not)

I have the following regex:

my $email_regex = qr/[a-zA-Z0-9._]+\@[a-zA-Z0-9._]+\.(com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|mil|biz|io|cc|co|info)/;

This is matching correctly up to the appropriate TLD, but then if the TLD has any additional alphanumeric characters after it, it is still counting it as a match (which it shouldn't), it just doesn't display any alphanumeric characters after the TLD.

input:

sample@gmail.com example@autotest.comcast.net<sender: apache.apache_testapache@apache.edu >
whoisthis@questions.gov,find@find.co{}Failure@pastattempts.frz;
sample2@yahoo.com
sample5@test.biz : test
sample92.sdfj@gmail.com
sample22_242@tech.org
greenjeans_93_who.ask@tester.info
computergeek324@ask.nets
anothertest.tester.gov@gmail.ch
helloooooow232@aol.com<;Senderfailure>
finaltest23_3test@yahoo.its

output (I have inserted comments to indicate what matched correctly and what shouldn't have matched but did anyways):

sample@gmail.com #correct
example@autotest.comcast.net #correct
apache.apache_testapache@apache.edu #should not match
whoisthis@questions.gov #correct
find@find.co #correct
Failure@past.attempts.fr #should not match
sample2@yahoo.com #correct
sample5@test.biz #correct
sample92.sdfj@gmail.com #correct
sample22_242@tech.org #correct 
greenjeans_93_who.ask@tester.info #correct
computergeek324@ask.net #should not match
anothertest.tester.gov@gmail.ch #correct
helloooooow232@aol.com #correct
finaltest23_3test@yahoo.it #should not match

EDIT: input file contains many other characters after the email such as < , > , :, ;, " these are okay and can still be matched, just not included in the output as seen above.

1
Need to set up the regex to match only when ...co|info is the last thing in the string; the way it's now the regex matches the given pattern but it's OK if the string then have more after it. So you need to add the end-of-string anchor. So ...co|info)$/ (or \Z) - zdim
... and you should also add the beginning-of-string achor, to make sure that you are not accepting junk at the beginning, i.e. = qr/^. - Stefan Becker
@zdim this is still not matching the correct number of emails, take a look at my edit maybe that could help? - learningunix717
Ah, that's different: change the anchor to word-boundary, \b, as @Nick says in their answer, except that you may have to allow < instead. - zdim
Why should apache.apache_testapache@apache.edu not match? It seems to fulfill your requirements? - Stefan Becker

1 Answers

3
votes

Since you are trying to find these within a larger string, you need to define what characters would not be considered part of the email address (I will assume any characters that you have not specified as allowed) so that you can anchor the beginning and ending of each match. Regexes will keep trying every possibility until they find a substring that matches, so unless you define these constraints, you will end up with the biggest chunks of what you consider "emails" that match your rules. One approach is to extract all possible strings of characters you allow, then run a second regex (your original regex), anchored to the beginning and end with \A and \z, to validate its format and the TLDs you want to allow.

Also note that since TLDs are not case sensitive, you probably want the /i regex modifier.

foreach my $email ($str =~ m/([a-zA-Z0-9._@]+)/g) {
    next unless $email =~ m/\A...\z/i;
}

Your regex is also woefully incomplete, email addresses are complex. (If you want to see what a complete email address parsing regex looks like, check out Email::Valid.) If you want to allow more valid email addresses and are flexible in your approach, I recommend using Email::Address::XS to parse them.

use strict;
use warnings;
use Email::Address::XS;

my $tld_re = qr/\.(com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|mil|biz|io|cc|co|info)\z/i;

my $address = Email::Address::XS->parse_bare_address($email);

if ($address->is_valid and $address->host =~ m/$tld_re/) {
   # matches
}