62
votes

I recently read somewhere that writing a regexp to match an email address, taking into account all the variations and possibilities of the standard is extremely hard and is significantly more complicated than what one would initially assume.

Why is that?

Are there any known and proven regexps that actually do this fully?

What are some good alternatives to using regexps for matching email addresses?

19
Something interesting about Email regular expression codinghorror.com/blog/archives/000214.htmlNikhil Kashyap
If you're just interested in matching common email patterns, you can have a look at some of the expressions here.On Freund
I think what you read pertains not to "validating an e-mail address according to the standard", but rather "validating an actual e-mail address". The difference is not subtle, even if the wording is. Currently, the answers below are a mix of the two. Perhaps you would clarify the question?bzlm
It is a common idiocy to parse complex text with a SINGLE regexp. But it is easy to parse complex text (such as C source code) with a SET of regexps, e.g. using lex and yacc. This method also does support recursion. Blame Larry. :)Sam Watkins

19 Answers

64
votes

For the formal e-mail spec, yes, it is technically impossible via Regex due to the recursion of things like comments (especially if you don't remove comments to whitespace first), and the various different formats (an e-mail address isn't always [email protected]). You can get close (with some massive and incomprehensible Regex patterns), but a far better way of checking an e-mail is to do the very familiar handshake:

  • they tell you their e-mail
  • you e-mail them a confimation link with a Guid
  • when they click on the link you know that:

    1. the e-mail is correct
    2. it exists
    3. they own it

Far better than blindly accepting an e-mail address.

22
votes

There are a number of Perl modules (for example) that do this. Don't try and write your own regexp to do it. Look at

Mail::VRFY will do syntax and network checks (does and SMTP server somewhere accept this address)

https://metacpan.org/pod/Mail::VRFY

RFC::RFC822::Address - a recursive descent email address parser.

https://metacpan.org/pod/RFC::RFC822::Address

Mail::RFC822::Address - regexp-based address validation, worth looking at just for the insane regexp

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

Similar tools exist for other languages. Insane regexp below...

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
11
votes

Validating e-mail addresses aren't really very helpful anyway. It will not catch common typos or made-up email addresses, since these tend to look syntactically like valid addresses.

If you want to be sure an address is valid, you have no choice but to send an confirmation mail.

If you just want to be sure that the user inputs something that looks like an email rather than just "asdf", then check for an @. More complex validation does not really provide any benefit.

(I know this doesn't answer your questions, but I think it's worth mentioning anyway)

8
votes

I've now collated test cases from Cal Henderson, Dave Child, Phil Haack, Doug Lovell and RFC 3696. 158 test addresses in all.

I ran all these tests against all the validators I could find. The comparison is here: http://www.dominicsayers.com/isemail

I'll try to keep this page up-to-date as people enhance their validators. Thanks to Cal, Dave and Phil for their help and co-operation in compiling these tests and constructive criticism of my own validator.

People should be aware of the errata against RFC 3696 in particular. Three of the canonical examples are in fact invalid addresses. And the maximum length of an address is 254 or 256 characters, not 320.

8
votes

There is a context free grammar in BNF that describes valid email addresses in RFC-2822. It is complex. For example:

" @ "@example.com

is a valid email address. I don't know of any regexps that do it fully; the examples usually given require comments to be stripped first. I wrote a recursive descent parser to do it fully once.

7
votes

It's not all nonsense though as allowing characters such as '+' can be highly useful for users combating spam, e.g. [email protected] (instant disposable Gmail addresses).

Only when a site accepts it though.

6
votes

Whether or not to accept bizarre, uncommon email address formats depends, in my opinion, on what one wants to do with them.

If you're writing a mail server, you have to be very exact and excruciatingly correct in what you accept. The "insane" regex quoted above is therefore appropriate.

For the rest of us, though, we're mainly just interested in ensuring that something a user types in a web form looks reasonable and doesn't have some sort of sql injection or buffer overflow in it.

Frankly, does anyone really care about letting someone enter a 200-character email address with comments, newlines, quotes, spaces, parentheses, or other gibberish when signing up for a mailing list, newsletter, or web site? The proper response to such clowns is "Come back later when you have an address that looks like [email protected]".

The validation I do consists of ensuring that there is exactly one '@'; that there are no spaces, nulls or newlines; that the part to the right of the '@' has at least one dot (but not two dots in a row); and that there are no quotes, parentheses, commas, colons, exclamations, semicolons, or backslashes, all of which are more likely to be attempts at hackery than parts of an actual email address.

Yes, this means I'm rejecting valid addresses with which someone might try to register on my web sites - perhaps I "incorrectly" reject as many as 0.001% of real-world addresses! I can live with that.

4
votes

Quoting and various other rarely used but valid parts of the RFC make it hard. I don't know enough about this topic to comment definitively, other than "it's hard" - but fortunately other people have written about it at length.

As to a valid regex for it, the Perl Mail::Rfc822::Address module contains a regular expression which will apparently work - but only if any comments have been replaced by whitespace already. (Comments in an email address? You see why it's harder than one might expect...)

Of course, the simplified regexes which abound elsewhere will validate almost every email address which is genuinely being used...

3
votes

Some flavours of regex can actually match nested brackets (e.g., Perl compatible ones). That said, I have seen a regex that claims to correctly match RFC 822 and it was two pages of text without any whitespace. Therefore, the best way to detect a valid email address is to send email to it and see if it works.

3
votes

Just to add a regex that is less crazy than the one listed by @mmaibaum:

^[a-zA-Z]([.]?([a-zA-Z0-9_-]+)*)?@([a-zA-Z0-9\-_]+\.)+[a-zA-Z]{2,4}$ 

It is not bulletproof, and certainly does not cover the entire email spec, but it does do a decent job of covering most basic requirements. Even better, it's somewhat comprehensible, and can be edited.

Cribbed from a discussion at HouseOfFusion.com, a world-class ColdFusion resource.

3
votes

An easy and good way to check email-adresses in Java is to use the EmailValidator of the Apache Commons Validator library.

I would always check an email-address in an input-form against something like this before sending an email - even if you only catch some typos. You probably don't want to write an automated scanner for "delivery failed" notification mails. :-)

2
votes

It's really hard because there are a lot of things that can be valid in an email address according to the Email Spec, RFC 2822. Things that you don't normally see such as + are perfectly valid characters for an email address.. according to the spec.

There's an entire section devoted to email addresses at http://regexlib.com, which is a great resource. I'd suggest that you determine what criteria matters to you and find one that matches. Most people really don't need full support for all possibilities allowed by the spec.

2
votes

If you're running on the .NET Framework, just try instantiating a MailAddress object and catching the FormatException if it blows up, or pulling out the Address if it succeeds. Without getting into any nonsense about the performance of catching exceptions (really, if this is just on a single Web form it is not going to make that much of a difference), the MailAddress class in the .NET framework goes through a quite complete parsing process (it doesn't use a RegEx). Open up Reflector and search for MailAddress and MailBnfHelper.ReadMailAddress() to see all of the fancy stuff it does. Someone smarter than me spent a lot of time building that parser at Microsoft, I'm going to use it when I actually send an e-mail to that address, so I might as well use it to validate the incoming address, too.

1
votes

Many have tried, and many come close. You may want to read the wikipedia article, and some others.

Specifically, you'll want to remember that many websites and email servers have relaxed validation of email addresses, so essentially they don't implement the standard fully. It's good enough for email to work all the time though.

1
votes

Try this one:

"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

Have a look here for the details.

However, rather than implementing the RFC822 standard, maybe it would be better to look at it from another viewpoint. It doesn't really matter what the standard says if mail servers don't mirror the standard. So I would argue that it would be better to imitate what the most popular mail servers do when validating email addresses.

1
votes

This class for Java has a validator in it: http://www.leshazlewood.com/?p=23

This is written by the creator of Shiro (formally Ki, formally JSecurity)

The pros and cons of testing for e-mail address validity:

There are two types of regexes that validate e-mails:

  1. Ones that are too loose.
  2. Ones that are too strict.

It is not possible for a regular expression to match all valid e-mail addresses and no e-mail addresses that are not valid because some strings might look like valid e-mail addresses but do not actually go to anyone's inbox. The only way to test to see if an e-mail is actually valid is to send an e-mail to that address and see if you get some sort of response. With that in mind, regexes that are too strict at matching e-mails don't really seem to have much of a purpose.

I think that most people who ask for an e-mail regex are looking for the first option, regexes that are too loose. They want to test a string and see if it looks like an e-mail, if it is definitely not an email, then they can say to the user: "Hey, you are supposed to put an e-mail here and this definitely is not a valid e-mail. Perhaps you didn't realize that this field is for an e-mail or maybe there is a typo".

If a user puts in a string that looks a lot like a valid e-mail, but it actually is not one, then that is a problem that should be handled by a different part of the application.

0
votes

Can anyone provide some insight as to why that is?

Yes, it is an extremely complicated standard that allows lots of stuff that no one really uses today. :)

Are there any known and proven regexps that actually do this fully?

Here is one attempt to parse the whole standard fully...

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

What are some good alternatives to using regexps for matching email addresses?

Using an existing framework for it in whatever language you are using I guess? Though those will probably use regexp internally. It is a complex string. Regexps are designed to parse complex strings, so that really is your best choice.

Edit: I should add that the regexp I linked to was just for fun. I do not endorse using a complex regexp like that - some people say that "if your regexp is more than one line, it is guaranteed to have a bug in it somewhere". I linked to it to illustrate how complex the standard is.

0
votes

For completeness of this post, also for PHP there is a language built-in function to validate e-mails.

For PHP Use the nice filter_var with the specific EMAIL validation type :)

No more insane email regexes in php :D

var_dump(filter_var('[email protected]', FILTER_VALIDATE_EMAIL));

http://www.php.net/filter_var

0
votes

There always seems to be an unaccounted for format when trying to create a regular expression to validate emails. Though there are some characters that are not valid in an email, the basic format is local-part@domain and is roughly 64 chars max on the local part and roughly 253 chars on the domain. Besides that, it's kind like the wild wild west.

I think the answer depends on your definition of a validated email address and what your business process has tolerance for. Regular expressions are great for making sure an email is formatted properly and as you know there are many variations of them that can work. Here are a couple of variations:

Variant 1:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Variant2:

\A(?:[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\z

Just because an email is syntactically correct doesn't mean it is valid.

An email can adhere to the RFC 5322 and pass the regex but there will be no true insight into the emails actual deliverability. What if you wanted to know if the email was a bogus email or if it was disposable or not deliverable or a known bot? What if you wanted to exclude emails that were vulgar or in some way factious or problematic? By the way, just so everyone knows, I work for a data validation company and with that I just wanted give full disclosure that I work for Service Objects but, being a professional in the email validation field, I feel the solution we offer provides better validation than a regex. Feel free to give it a look, I think it can help a lot. You can see more info about this in our dev guide. It actually does a lot of cool email checks and verification's.

Here's an example:

Email: [email protected]

{
  "ValidateEmailInfo":{
      "Score":4,
      "IsDeliverable":"false",
      "EmailAddressIn":"[email protected]",
      "EmailAddressOut":"[email protected]",
      "EmailCorrected":false,
      "Box":"mickeyMouse",
      "Domain":"gmail.com",
      "TopLevelDomain":".com",
      "TopLevelDomainDescription":"commercial",
      "IsSMTPServerGood":"true",
      "IsCatchAllDomain":"false",
      "IsSMTPMailBoxGood":"false",
      "WarningCodes":"22",
      "WarningDescriptions":"Email is Bad - Subsequent checks halted.",
      "NotesCodes":"16",
      "NotesDescriptions":"TLS"
  }
}