1
votes

I am trying to validate a email address using a simplest form of regular expression - not - RFC‑822–compliant regex

and also need to capture username - sub-domain (if any) - domain and - TLD suffix i.e. (com, net ....) For this I've come up with following regex:

/^([a-z0-9_\-\.]{6,})+@((?:[a-z0-9\.])*)([a-z0-9_\-]+)[\.]([a-z0-9]{2,})$/i

and for example the emails are:

[email protected]
[email protected]
[email protected]
[email protected]

and the regex should validate them all and capture all the groups.

So, I was wondering if the regex is correct or is there anything else I need to consider too?

3
Ahh, it's because I needed the username to be at least 6 characters. Sorry I forgot to include that within my questionbn00d

3 Answers

2
votes

n00p, I see that you had not yet found an expression to do exactly what you wanted, and that you said "may be someone will come up with better solution and post it here".

So here is a regex that does what you wanted. I have modified your own expression the least amount possible, assuming that you knew what you wanted.

To make it easy to read, the expression is in free-spacing mode. You use it like any other regex.

$regex = "~(?ix) # case-insensitive, free-spacing
^                # assert head of string
([a-z0-9_-]{6,24})    # capture username to Group 1
(?<=[0-9a-z])     # assert that the previous character was a digit or letter
@                 # literal
(                 # start group 2: whole domain
(?:[a-z0-9-]+\.)* # optional subdomain: don't capture
(                 #start group 3: domain
[a-z0-9_-]+       # the last word
\.                # the dot
([a-z]{2,})       # capture TLD to group 4
)                 # end group 3: domain
)                 # end group 2: whole domain
$                 # assert end of string
~";

This will capture username to Group 1, the whole domain to Group 2, domain to Group 3, and the TLD to Group 4.

One small change you will see is that I have unescaped the - and . in the character classes because there is no need to do so. I did not replace the [a-z0-9_] expressions with \w because if you ever switch to unicode or a different locale we might have surprising results.

Here is the whole thing in use:

<?php
$emails = array("[email protected]",
           "[email protected]",
           "[email protected]",
           "[email protected]");

$regex = "~(?ix) # case-insensitive, free-spacing
^                # assert head of string
([a-z0-9_-]{6,24})    # capture username to Group 1
(?<=[0-9a-z])     # assert that the previous character was a digit or letter
@                 # literal
(                 # start group 2: whole domain
(?:[a-z0-9-]+\.)* # optional subdomain: don't capture
(                 #start group 3: domain
[a-z0-9_-]+       # the last word
\.                # the dot
([a-z]{2,})       # capture TLD to group 4
)                 # end group 3: domain
)                 # end group 2: whole domain
$                 # assert end of string
~";

echo "<pre>";
foreach($emails as $email) {
    if(preg_match($regex,$email,$match)) print_r($match);
}
echo "</pre>";
?>

And here is the output:

Array
(
    [0] => [email protected]
    [1] => username
    [2] => domain.com
    [3] => domain.com
    [4] => com
)
Array
(
    [0] => [email protected]
    [1] => username
    [2] => us.domain.com
    [3] => domain.com
    [4] => com
)
Array
(
    [0] => [email protected]
    [1] => username
    [2] => au.domain.com
    [3] => domain.com
    [4] => com
)
Array
(
    [0] => [email protected]
    [1] => username
    [2] => us.au.domain.com
    [3] => domain.com
    [4] => com
)
1
votes

Most likely you'd better with using parse_url to get the parts and then do any kind of validation against the separate parts

0
votes

I've tried for some time myself but I still don't have the most appropriate result that I have been trying to get, but this is the closest I've got so far:

^([a-z0-9_\-\.]{6,24})(?<=[0-9a-z])@((?:[a-z0-9][-\w]*[a-z0-9]*\.)+([a-z]{2,}))$

This will capture the username, TLD suffix and the whole domain as well as validate the email with or without belonging to sub-domain. But I am still not able to just extract the domain name. I think I can live with that for now.

For email such as [email protected] it will validate and capture username, domain.com and com and for other emails like [email protected] it will validate and capture username, au.domain.com and com

Which is not exactly not what I've wanted and may be someone will come up with better solution and post it here.