Issues

ZF-42: Zend_Validate::isEmail() implementation (TRAC#34)

Description

in accordance with sections 3.2.4 , 3.4, 3.4.1 of the RFC .



 public static function isEmail($value)
    {
        /**
         * @todo RFC 2822 (http://www.ietf.org/rfc/rfc2822.txt)
         */
        $atom = '[-a-z0-9!#$%&\'*+/=?^_`{|}~]';
        $domain = '([a-z0-9]([-a-z0-9]*[a-z0-9]+)?)';
        $regex = '^' . $atom . '+' . '(\.' . $atom . '+)*\@(' . $domain . '{1,63}\.)+' . $domain . '{2,63}$';

        if (eregi($regex, $value))
        {
            return $value;
        }

        return false;
    }

Comments

***** patch on trac

05/07/06 14:28:55: Modified by mike

* owner changed from zend to jim2012@gmail.com.

POSIX regexp support (ereg) is moving to PECL per the Paris PDM (see http://php.net/~derick/meeting-notes.html/…). Please refactor to use PCRE and send unit tests. 05/08/06 14:49:31: Modified by jim2012@gmail.com

Here's a new version of this function. Using preg_match() now. Adherence to RFC 2822 has been lowered to provide "good enough" matching for emails that an end-user might actually try to send emails to and not emails that are valid for display in Outlook Express eeeek. This validates emails like 'john.doe@example.museum.uk' and not "John O'Donnell (my friend) john!O'Donnell!@ex.museum.uk" which the RFC says are valid.


 public static function isEmail($value)
    {
        /**
         * @todo RFC 2822 (http://www.ietf.org/rfc/rfc2822.txt)
         */
        $atom = '[-a-z0-9_]';
        $domain = '([a-z0-9]([-a-z0-9]*[a-z0-9]+)?)';
        $regex = '/^' . $atom . '+' . '(\.' . $atom . '+)*\@(' . $domain . '{1,63}\.)+' . $domain . '{2,63}$/i';

        if (preg_match($regex, $value))
        {
            return $value;
        }

        return false;
    }

i'll attach a patch for FilterTest?.php 05/08/06 14:50:04: Modified by jim2012@gmail.com

* attachment filtertest.patch added.

05/16/06 11:05:58: Modified by zend@caomhin.demon.co.uk

The patched regex doesn't catch email addresses which use + which a lot of people use so something like bob+sitename@example.com gets delivered to bob@example.com when they give their email address out.

The following amended $atom variable catches these addresses too

$atom = '[-a-z0-9_+]'; 05/27/06 15:11:17: Modified by karim79@gmail.com

For those of you who might be interested, this function first checks if the input string is a correctly formatted email. Passing that, the code checks the existence of the domain using checkdnsrr() then if an MX exists at that domain. /**

*
      o
            + Returns value if it is a valid email format, FALSE otherwise. 

        *

*
      o
            + @param mixed $value
            + @return mixed 

        */

    public static function isEmail($value) {

if (preg_match('/[a-zA-Z0-9\._-]+\@(\[?)[a-zA-Z0-9\-\.]+'.

            '\.([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$/', $value)) {

    if ($domainCheck && function_exists('checkdnsrr')) { list (, $domain) = explode('@', $value);

if (checkdnsrr($domain, 'MX') checkdnsrr($domain, 'A')) {

        return true;

    } return false; } return true;

} return false;

    }

"Zend_Filter provides a library of static methods for filtering data."
Lets not make it do any more than that. Email validation should be done elsewhere.

public static function isEmail($value,$domcheck = true) { if ( (preg_match('/(@.*@)|(..)|(@.)|(.@)|(^.)/', $value)) || (preg_match('/^.+\@([?)[a-zA-Z0-9-.]+.([a-zA-Z]{2,3}|[0-9]{1,3})(]?)$/',$value)) ) { if ($domcheck === true) { $dom = substr($value, strpos($value, '@') +1); if ( checkdnsrr($dom.'.', 'MX' ) ) return TRUE; if ( checkdnsrr($dom.'.', 'A' ) ) return TRUE; if ( checkdnsrr($dom.'.', 'CNAME') ) return TRUE; } else { return TRUE; }
} return FALSE; }

Original comment by Markus Wolff:

While Zend_Filter_Input::testEmail() is already advertised in the manual, said method relies on Zend_Filter::isEmail(), which currently contains nothing more than a TODO marker. I don't know what kind of mega-super-duper email validation stuff was originally planned for this, but considering the method is needed by methods that are documented as if they were working, wouldn't it be nice to at least put a simple RegEx in place here, so that the method at least works in 95% of all cases?

If nothing else, it saves first-time users from having to backtrack the source of the problem throughout the ZF source.

Here's a regex that works for me: /^a-zA-Z?@([[:alnum:]-_].)+[a-zA-Z]{2,4}$/

I like a default validation without a domain check for speed, and with a generic enough regex to allow all valid tlds (many regex fail to match the .museum tld for example).

The problem with the domain check added by Andre (see 07/Jul/06) is that 'checkdnsrr' is not implemented for Windows servers. I'm developing on Windows running Apache, but plan to release on Linux.... All such implementations of this type of framework, especially here, need to be platform independent. A complete domain check would also connect to the domain and start writing a pseudo email to actually validate the address itself as existing and valid. This gets expensive on large operations, however it can be helpful for true validity tests.

The regex trade-off of course lies in how many false positives can be allowed vs. false negatives... I generally lean toward limiting false negatives because the users suffer. The further email check can reduce error rates even more at the expense of network resources.

At a minimum, this should be released with a very basic regex check with a todo for upgrading (hopefully adding domain checks), not leaving it blank. In addition, leaving it clean enough to easily extend with such domain checks and pseudo-email checks I think is essential.

I think the driving need for something like isEmail() requires more than a simple function accepting a string and returning a boolean. For most applications I've worked on over the years, we needed something closer to:

http://search.cpan.org/~rjbs/Email-Valid-0.175/…

I think comparing the implementations of known "robust" solutions will highlight some surprising differences. I would also like to hear the 3 authors critique each other's implementations, if they are so inclined.

see file named RFC822.php (the author of the related PEAR class) http://www.phpguru.org/static/mime.mail.html

Used by Horde CVS v3.2 (not in stable yet) http://iamcal.com/publish/articles/…

Note the "strict" flag that allows bypassing extensive checks in favor of something more basic http://java.sun.com/j2ee/1.4/…

Mislav Marohnic wrote:

isEmail development should start with an extensive test case - only after that real function has a chance to start rolling.

I think we should review the approaches above, and approach the authors, asking if they would like to contribute their code to the ZF, instead of us trying to re-invent such a common, but difficult wheel.

From Pascal's email (linked below), we are reminded that not everyone wants or needs an exacting RFC822-compliant validation. Sometime a weird character in an email address is much more likely to be an accident, even if the address validates to RFC822. In fact, given the complexity of code to perform strict validation of RFC822, I like David's suggestion (see his comments for [ZF-42] http://framework.zend.com/issues/browse/ZF-42 ). Having a validator optimized for speed that doesn't return false positives could be useful for batch processing as the first pass over a set of email addresses with a high percentage of bad addresses. However, that is probably overkill for the ZF.

http://nabble.com/isEmail-filter.-tf1907194.html/…

[1] RFC 2822, 3.4.1. Addr-spec specification: http://www.ietf.org/rfc/rfc2822.txt

[2] RFC 1035, 2.3.1. Preferred name syntax: http://www.ietf.org/rfc/rfc1035.txt

[3] List of Top Level Domains: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I think it is very confusing to have the isEmail() method in the specification but in practise it won't work. It took me about half an hour to understand what went wrong. I think validation of user input (and output) is very important, and this issue should be solved soon.

I found the following implentation/explaination of all the things the RFC describes.

http://ilovejackdaniels.com/php/…

I hope this would be usefull.

-- Robert Ros

Ok for what's about Robert give as link.

But it miss completely IDN domain like schlümberger.de or hägi.ch which could be regexp valid as this 2 tld authorize them :-)))

And I don't speak about all other language where IDN are in use.

If extended characters are in use, what about the utf-8 and + trouble ? Will primitive used utf aware ?

This updates one of the earlier patches submitted.

Index: library/Zend/Filter.php

--- library/Zend/Filter.php (revision 849) +++ library/Zend/Filter.php (working copy) @@ -239,6 +239,16 @@ /** * @todo RFC 2822 (http://www.ietf.org/rfc/rfc2822.txt) / + $atom = '[-a-z0-9_+!#\$\%\&\'*\/\=\?\^`{|}\~]'; + $domain = '(a-z0-9?)'; + $regex = '/^' . $atom . '+' . '(.' . $atom . '+)\@(' . $domain . '{1,63}.)+' . $domain . '{2,63}$/i'; + + if (preg_match($regex, $value)) + { + return $value; + } + + return false; }

 /**

Note the chaned $atom line. http://tools.ietf.org/html/rfc2822 allows many more characters in the local part than were allowed in the first patch.

Changed to "bug", since documentation and API implies isEmail() and testEmail() work.

Changing fix version to 0.9.0.

If it won't be implemented partially, the documentation should be at least be changed.

If we're waiting for PHP 5.2 to become the minimum requirement (re: the push to 0.9.0), it won't help, since filter_var() is only RFC 822-compliant (or is it? I don't know, the regular expression it uses seems a bit on the simple side). On that note, all of Gavin's implementations posted above are RFC 822-only also.

I agree with Gavin's idea of contacting the people behind other robust e-mail validation routines and asking them to update their code for RFC 2822 and contribute it to the framework.

It seems to me at first blush that any real validation is outside the scope of the Zend_Filter class. My first inclination would be to tokenize the string and go through it bit by bit, and many of the implementations I'm glancing at look like they do just that.

I did a bit of preliminary work over Xmas on this, and came up with the following. The method splits up the email address and uses Zend_Filter::isHostname to match the domain (which I think may need checking against the RFC).

the local part is then checked against first the dot-atom characters, which should form the majority of valid local email addresses. There is also the quoted string format and obsolete formats, however, these should not be used for modern email addresses so it seems overkill to write a single regex to check for all possible formats.

So my approach is to do a check against dot-atom, and if that fails try quoted string, then try the obsolete format. I haven't got around to the latter two formats yet. I believe that should be a faster match 99% of the time.

public static function isEmail($value) { // Split email address up if (preg_match('/^([^@]+)@([^@]+)$/', $value, $matches)) { $localPart = $matches[1]; $domain = $matches[2];

    // Match domain part (allow hostnames and IP addresses)
    $domainResult = self::isHostname($domain, 3);

    // First try to match the local part on dot-atom characters
    // since this is the most common format
    $localResult = false;

    // Dot-atom characters
    // ALPHA / DIGIT / and "!", "#", "$", "%", "&", "'", "*", "+",
    // "-", "/", "=", "?", "^", "_", "`", "{", "|", "}", "~", "."
    // Dot character "." must be surrounded by other non-dot characters
    $dotAtom = '[a-zA-Z0-9\x21\x23\x24\x25\x26\x27\x2a\x2b\x2d\x2f\x3d';
    $dotAtom .= '\x3f\x5e\x5f\x60\x7b\x7c\x7d\x7e\x2e]';

    // TODO: speed test strpos instead of preg_match for dot in start and end of string
    if ( (preg_match('/^' . $dotAtom . '+$/', $localPart)) &&
    (!preg_match('/^\x2e/', $localPart)) && (!preg_match('/\x2e$/',

$localPart)) ) { $localResult = true; }

    // If not matched, try quoted string format
    if (!$localResult) {
        // TODO
    }

    // if not matched, try obsolete format
    if (!$localResult) {
        // TODO
    }

    if ($localResult && $domainResult) {
        return true;
    }
}
return $false;

}

There was also some mention of this method checking the network to see if the email address actually is valid. I don't think this method is the best place for that for time delay issues. Personally, I'd prefer to check real email addresses via the network asynchronously via a separate cron script.


In reply, Bruno Friedman wrote:

Available real tld are: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

And in my case I check the hostname with checkdnsrr to see if there one record or more. (I've a special case for windows os with a system call to dig and/or nslookup). But as always this impose network connection, but for the dns is took only 2 to 4 seconds fi there's not a plenty of resolver inside the web server.

Checking about the domain part (without the tld) wouldn't be easy, because differents rules applies to each tld authority. For example in Switzerland your domain should be 26 max length, but now they accept IDN domain (think to müller ) and the domain max length is 64.

I'm not really agree with your asynchrone checking. Because how to says to a anonymous user behind is browser that's the email he entered is some kind of wrong.

Before the massive usage of greylisting, i've used a network method to directly speak with the mailers founds in DNS. To see if I've a mail to this user would be accepted. And it works very nice ! And 100% email are true. But I've no time to adapt it to greylisting error message.


In reply I wrote:

Essentially I believe this to be an application design decision. Zend_Filter::isEmail should, in my opinion, be just a character format check. If required then a network checker for emails could exist as a separate function. Then developers can use both at once, or one after the other.

A suggested method to check via the network would be:

dig MX yourdomain.com

  • returns mail.yourdomain.com as the answer to this DNS query

telnet mail.yourdomain.com 25

HELO mydomain.com

MAIL FROM:name@mydomain.com

RCPT TO:name@yourdomain.com

This would usually return a status of 250 if the email address is valid.


In reply Bruno Friedman wrote:

Yes the modular approach is the best one ... Syntax check, Network check.

For what you explain at the end is the most difficult part. Think about greylisting ( receiver mailer send a error first and a delay for coming back usually 180 sec ) Or what about mailer that's just couldn't be reache at the time we check ... It's not a trouble for a real mailer it will retry some time later, but our script it is ... :-))

Without going into great detail, attempting to verify that an email address actually works without having a business relationship to the owner of the domain is a bad idea. Eons ago, spammers realized this actually worked sometimes. Soon, administrators that knew what they were doing closed this loophole. Yahoo, AOL, and many others now have an API and system for business partners to validate email addresses effectively and efficiently. Everyone else probing for deliverable email addresses are normally blocked by these admins, and possibly black-listed or tar-pitted.

However, verifying that a DNS record (either MX or A) exists for the domain is helpful. I've used dns_get_record() with good results, although various complications might arise that introduce unpredictable results -e.g. not all libresolv's are created equal, nor are all systems configured reliably, such as multiple DNS servers, reasonable timeouts, etc. Even with a nicely tuned and configured bind install, with caching enabled, I find my web pages still "hiccup" sometimes with unpleasant delays noticeable to the end user, if I make the page wait for completion of dns_get_record() - obviously simply a byproduct of the nature of DNS lookups.

I still believe the API for http://search.cpan.org/~rjbs/Email-Valid-0.175/… is vastly superior for most web applications to the overly simplistic isEmail() returning a boolean.

I've attached a proposed stop-gap for 0.7 if this is considered useful for the community. I'll continue to look into the full character matching the domain MX matching. Am also talking to Darby on the new Zend_Filter proposal at http://framework.zend.com/wiki/display/… which seems to affect this

I think it would be useful to have this function implemented for 0.7. I feel that this is an obvious gaping hole in the framework that can be easily fixed - even if we do have to replace it with the new Zend_Filter in an upcoming release.

Gavin, or anyone else at Zend, would it be acceptable to push this into Zend_Filter for 0.7 with the understand it will continue to be developed under the new Zend_Filter system?

If so, how do I go about this? Do you want an updated full Zend/Filter.php file uploaded to this page - I presume I cannot commit anything to the SVN repository..

Have added basic functionality via the new Zend_Filter_EmailAddress in the lab - http://framework.zend.com/fisheye/browse/…

And will continue to develop this functionality there

Development has moved to the incubator. Currently basic unit tests have 100% code coverage, and some @todos remain in the code. Thank you everyone for your contributions as we continue to strive for improvements! :)

Revision 3429, which will be released with 0.8, validates the common email formats dot-atom and quoted-string. There may be some obsolete local-part formats that the validator doesn't currently pass though I believe these will rarely be required and it's more useful to release Zend_Validate_EmailAddress now which covers 90% of the RFC and probably 99.9% of all real-world usage.

I will continue to look at the character matching for 0.9 so if anyone has any direct comments on things I've missed please let me know

Also, I'll look at the MX lookup for the hostname for 0.9 - the character matching for the hostname is entirely within Zend_Validate_Hostname

Usage documentation has also been added to the end-user manual.

changed component to Zend_Validate (was Zend_Filter)

MX checking is now supported for EmailAddress validation (see attached file). This is done by passing a second parameter in the constructor, i.e.



The following method can be used publically (if you so wish) to see if MX checking is available:

At present this functionality depends on dns_get_mx() which is UNIX only.

Is this acceptable since this won't work with Windows? Looking into it PEAR has functions that do the same thing and do support Windows but these are far more complex and would probably require a seperate class to organise.

Fixed in version 3927

Documentation to be updated shortly

If you try to use MX checking on a Windows system this will throw an exception because at present this feature is only supported on UNIX machines via dns_get_mx

it would be worth looking into Windows support via another method, but I suspect this will be rather complex and may make sense to be encapsulated within another network type library class