Issues

ZF-1883: Add polish top-level domain allowed utf-8 characters in Zend_Validate_Hostname

Description

I've created file with allowed utf-8 characters for polish top-level domains for use in Zend_Validate_Hostname. This should probably be put into Zend/Validate/Hostname/Pl.php file (new file) so it would be similar like with languages already supported.

This file code is:


<?php
class Zend_Validate_Hostname_Pl implements Zend_Validate_Hostname_Interface
{
    /**
     * Returns UTF-8 characters allowed in DNS hostnames for the specified Top-Level-Domain
     *
     * @see http://www.dns.pl/IDN/rejestracja_domen_idn.txt Polish (.PL)
     * @return string
     */
    static function getCharacters()
    {
        return  '\x{00B7}\x{00E0}-\x{00F6}\x{00F8}-\x{00FF}\x{0101}\x{0103}\x{0105}\x{0107}\x{010B}' .
                '\x{010D}\x{010F}\x{0111}\x{0113}\x{0115}\x{0117}\x{0119}\x{011B}\x{011D}\x{011F}' .
                '\x{0121}\x{0123}\x{0125}\x{0127}\x{0129}\x{012B}\x{012D}\x{012F}\x{0131}\x{0135}' .
                '\x{0137}\x{0138}\x{013A}\x{013C}\x{013E}\x{0142}\x{0144}\x{0146}\x{0148}\x{014B}' .
                '\x{014D}\x{014F}\x{0151}\x{0153}\x{0155}\x{0157}\x{0159}\x{015B}\x{015D}\x{015F}' .
                '\x{0161}\x{0163}\x{0165}\x{0167}\x{0169}\x{016B}\x{016D}\x{016F}\x{0171}\x{0173}' .
                '\x{0175}\x{0177}\x{017A}\x{017C}\x{017E}\x{0390}\x{03AC}-\x{03CE}\x{0430}-\x{045F}' .
                '\x{0491}\x{0492}\x{05D0}-\x{05EA}';
    }
}
?>

Unfortunately document that specifies that special characters is available only in polish (http://www.dns.pl/IDN/rejestracja_domen_idn.txt).

Comments

Assigned to Darby

Darby - I can implement this as long as you're happy with the regex. Obviously I can't check the Polish site :-)

Yes, Simon, please do. Thank you!

This is not quite as simple as the supplied code since the .Pl domains accept 4 different character sets (Hebrew, Cyrillic, etc) and you can't mix characters between different character sets.

So I'll have to look into this a litle deeper, and will try to look at some other IDNs (i.e. .com) too.

Maybe the simpliest way to validate different character sets allowed for domain when you can't mix characters berween different sets is the following.

Allow the getCharacters method to return also arrays of different set characters as separate items. This needs to be done only in the interface comment (so that developers know that arrays can also be returned) and in the isValid method in HostnameValidator class. Now the hostname validation if array was returned should go like this: 1) check domain name without any special characters, if it is valid then validation is over with success 2) if it is not valid then check it again separately for every character set in array returned by the getCharacters method 3) if the domain name is valid only one of the checks from point 2 should return true, if none of the checks return true or if more then one return true then given value is not valid

Changes that needs to be done:

Change Zend/Validate/Hostname/Interface.php so that function getCharacters could return also arrays (just a comment fix)


interface Zend_Validate_Hostname_Interface
{

    /**
     * Returns UTF-8 characters allowed in DNS hostnames for the specified Top-Level-Domain
     *
     * UTF-8 characters should be written as four character hex codes \x{XXXX}
     * For example Ă© (lowercase e with acute) is represented by the hex code \x{00E9}
     *
     * You only need to include lower-case equivalents of characters since the hostname
     * check is case-insensitive
     *
     * Please document the supported TLDs in the documentation file at:
     * manual/en/module_specs/Zend_Validate-Hostname.xml
     *
     * @see http://en.wikipedia.org/wiki/…
     * @see http://www.iana.org/cctld/ Country-Code Top-Level Domains (TLDs)
     * @see http://www.columbia.edu/kermit/utf8-t1.html UTF-8 characters
     * @return string|array
     */
    static function getCharacters();
}

Then modify the incoming Zend/Validate/Hostname/Pl.php into


class Zend_Validate_Hostname_Pl implements Zend_Validate_Hostname_Interface
{
    /**
     * Returns UTF-8 characters allowed in DNS hostnames for the specified Top-Level-Domain
     *
     * @see http://www.dns.pl/IDN/rejestracja_domen_idn.txt Polish (.PL)
     * @return string
     */
    static function getCharacters()
    {
        $latin = '\x{00B7}\x{00E0}-\x{00F6}\x{00F8}-\x{00FF}\x{0101}\x{0103}\x{0105}\x{0107}\x{010B}' .
                '\x{010D}\x{010F}\x{0111}\x{0113}\x{0115}\x{0117}\x{0119}\x{011B}\x{011D}\x{011F}' .
                '\x{0121}\x{0123}\x{0125}\x{0127}\x{0129}\x{012B}\x{012D}\x{012F}\x{0131}\x{0135}' .
                '\x{0137}\x{0138}\x{013A}\x{013C}\x{013E}\x{0142}\x{0144}\x{0146}\x{0148}\x{014B}' .
                '\x{014D}\x{014F}\x{0151}\x{0153}\x{0155}\x{0157}\x{0159}\x{015B}\x{015D}\x{015F}' .
                '\x{0161}\x{0163}\x{0165}\x{0167}\x{0169}\x{016B}\x{016D}\x{016F}\x{0171}\x{0173}' .
                '\x{0175}\x{0177}\x{017A}\x{017C}\x{017E}';

        $greek = '\x{0390}\x{03AC}-\x{03CE}';

        $cyrillic = '\x{0430}-\x{045F}\x{0491}\x{0492}';

        $hebrew = '\x{05D0}-\x{05EA}'; 
        
        return array($latin, $greek, $cyrillic, $hebrew);
    }
}

Now the last change is in Zend/Validate/Hostname.php isValid() method. Change following lines (319 - 369):


/**
 * Match against IDN hostnames
 * @see Zend_Validate_Hostname_Interface
 */
$labelChars = 'a-z0-9';
$utf8 = false;
$classFile = 'Zend/Validate/Hostname/' . ucfirst($this->_tld) . '.php';
if ($this->_validateIdn) {
    if (Zend_Loader::isReadable($classFile)) {

        // Load additional characters
        $className = 'Zend_Validate_Hostname_' . ucfirst($this->_tld);
        Zend_Loader::loadClass($className);
        $labelChars .= call_user_func(array($className, 'getCharacters'));
        $utf8 = true;
    }
}

// Keep label regex short to avoid issues with long patterns when matching IDN hostnames
$regexLabel = '/^[' . $labelChars . '\x2d]{1,63}$/i';
if ($utf8) {
    $regexLabel .= 'u';
}

// Check each hostname part
$valid = true;
foreach ($domainParts as $domainPart) {

    // Check dash (-) does not start, end or appear in 3rd and 4th positions
    if (strpos($domainPart, '-') === 0 ||
    (strlen($domainPart) > 2 && strpos($domainPart, '-', 2) == 2 && strpos($domainPart, '-', 3) == 3) ||
    strrpos($domainPart, '-') === strlen($domainPart) - 1) {

        $this->_error(self::INVALID_DASH);
        $status = false;
        break 2;
    }

    // Check each domain part
    $status = @preg_match($regexLabel, $domainPart);
    if ($status === false) {
        /**
         * Regex error
         * @see Zend_Validate_Exception
         */
        require_once 'Zend/Validate/Exception.php';
        throw new Zend_Validate_Exception('Internal error: DNS validation failed');
    } elseif ($status === 0) {
        $valid = false;
    }
}

into something like:


/**
 * Match against IDN hostnames
 * @see Zend_Validate_Hostname_Interface
 */
$labelChars = 'a-z0-9';
$utf8 = false;
$classFile = 'Zend/Validate/Hostname/' . ucfirst($this->_tld) . '.php';
if ($this->_validateIdn) {
    if (Zend_Loader::isReadable($classFile)) {

        // Load additional characters
        $className = 'Zend_Validate_Hostname_' . ucfirst($this->_tld);
        Zend_Loader::loadClass($className);
        $specialChars = call_user_func(array($className, 'getCharacters'));
        if (!is_array($specialChars)) {
            $labelChars .= $specialChars;
        }
        $utf8 = true;
    }
}

// Keep label regex short to avoid issues with long patterns when matching IDN hostnames
$regexLabel = '/^[' . $labelChars . '\x2d]{1,63}$/i';
if ($utf8) {
    $regexLabel .= 'u';
}

// Check each hostname part
$valid = true;
foreach ($domainParts as $domainPart) {

    // Check dash (-) does not start, end or appear in 3rd and 4th positions
    if (strpos($domainPart, '-') === 0 ||
    (strlen($domainPart) > 2 && strpos($domainPart, '-', 2) == 2 && strpos($domainPart, '-', 3) == 3) ||
    strrpos($domainPart, '-') === strlen($domainPart) - 1) {

        $this->_error(self::INVALID_DASH);
        $status = false;
        break 2;
    }

    // Check each domain part
    $status = @preg_match($regexLabel, $domainPart);
    
    // If status is false it means that there are special characters in domain part that needs to be validated
    if ($status === false && isset($specialChars) && is_array($specialChars)) {
        
        $statusArray = array(); // holds validation result for every character set
        for ($i = 0; $i < count($specialChars); $i++) {
            $regexLabel = '/^[' . $specialChars[$i] . '\x2d]{1,63}$/i'; // build new regex for set
            if ($utf8) {
                $regexLabel .= 'u';
            } 
            $statusArray[$i] = @preg_match($regexLabelTemp, $domainPart);
        }
        
        // If domain name is correct there should be only one 'TRUE' value in $statusArray
        $trueCount = 0;
        for ($i = 0; $i < count($statusArray); $i++) {
            if ($statusArray[$i]) {
                $trueCount++;
            }
        }
        
        $status = ($trueCount == 1); 
    }
    
    if ($status === false) {
        /**
         * Regex error
         * @see Zend_Validate_Exception
         */
        require_once 'Zend/Validate/Exception.php';
        throw new Zend_Validate_Exception('Internal error: DNS validation failed');
    } elseif ($status === 0) {
        $valid = false;
    }
}

This should work if I haven't done any typos.

Thanks for the suggestions Lukasz.

I'd suggest the following small changes:

1) If IDN is enabled and if there are any special characters in the domain (i.e. anything other than a-z, dash "-", and 0-9) do the following steps: 2) Load the first allowed character set and try to validate the domain 3) If fail, go to the next allowed character set and try again 4) Repeat until get to the last allowed character set

I agree returning arrays from getCharacters is a good idea to support this. A mixed return type isn't ideal, but I don't think it's against the ZF Coding Standards.

Also, the whole domain (including any sub-domain parts) has to match one character set rather than allow any character set per domain part. I think your code would allow two different characters sets for two different domain parts.

I probably also need to look into whether sub-domains (i.e. www, dev, etc) can include international characters at all since if not this process is simplified and we can just check for the main "domain name" part (for example "domain" in "www.domain.com") against these characters sets.

However, another implementation I saw on the web (http://tldchk.berlios.de/) groups some character sets into external files which are then included. This suggests it's possible to share character sets across some IDN domains. I'll check into this first since if true this may make it easier to maintain and cope with domains that accept large character sets.

This doesn't appear to have been fixed in 1.5.0. Please update if this is not correct.

No this has not been fixed. Apologies, i had a baby (at least my wife did!) before Christmas and it's taken my life over somewhat. I will take a look at this issue and will post back a more helpful response within a week

New feature implemented with the last rework of Zend_Validate_HostName.