ZF-1883: Add polish top-level domain allowed utf-8 characters in Zend_Validate_Hostname
Description
I've created file with allowed utf-8 characters for polish top-level domains for use in Zend_Validate_Hostname. This should probably be put into Zend/Validate/Hostname/Pl.php file (new file) so it would be similar like with languages already supported.
This file code is:
<?php
class Zend_Validate_Hostname_Pl implements Zend_Validate_Hostname_Interface
{
/**
* Returns UTF-8 characters allowed in DNS hostnames for the specified Top-Level-Domain
*
* @see http://www.dns.pl/IDN/rejestracja_domen_idn.txt Polish (.PL)
* @return string
*/
static function getCharacters()
{
return '\x{00B7}\x{00E0}-\x{00F6}\x{00F8}-\x{00FF}\x{0101}\x{0103}\x{0105}\x{0107}\x{010B}' .
'\x{010D}\x{010F}\x{0111}\x{0113}\x{0115}\x{0117}\x{0119}\x{011B}\x{011D}\x{011F}' .
'\x{0121}\x{0123}\x{0125}\x{0127}\x{0129}\x{012B}\x{012D}\x{012F}\x{0131}\x{0135}' .
'\x{0137}\x{0138}\x{013A}\x{013C}\x{013E}\x{0142}\x{0144}\x{0146}\x{0148}\x{014B}' .
'\x{014D}\x{014F}\x{0151}\x{0153}\x{0155}\x{0157}\x{0159}\x{015B}\x{015D}\x{015F}' .
'\x{0161}\x{0163}\x{0165}\x{0167}\x{0169}\x{016B}\x{016D}\x{016F}\x{0171}\x{0173}' .
'\x{0175}\x{0177}\x{017A}\x{017C}\x{017E}\x{0390}\x{03AC}-\x{03CE}\x{0430}-\x{045F}' .
'\x{0491}\x{0492}\x{05D0}-\x{05EA}';
}
}
?>
Unfortunately document that specifies that special characters is available only in polish (http://www.dns.pl/IDN/rejestracja_domen_idn.txt).
Comments
Posted by Thomas Weidner (thomas) on 2007-08-27T14:47:13.000+0000
Assigned to Darby
Posted by Simon R Jones (studio24) on 2007-09-17T09:18:58.000+0000
Darby - I can implement this as long as you're happy with the regex. Obviously I can't check the Polish site :-)
Posted by Darby Felton (darby) on 2007-09-17T09:27:05.000+0000
Yes, Simon, please do. Thank you!
Posted by Simon R Jones (studio24) on 2007-09-20T11:32:57.000+0000
This is not quite as simple as the supplied code since the .Pl domains accept 4 different character sets (Hebrew, Cyrillic, etc) and you can't mix characters between different character sets.
So I'll have to look into this a litle deeper, and will try to look at some other IDNs (i.e. .com) too.
Posted by Lukasz Rajchel (rayell) on 2007-09-25T09:42:37.000+0000
Maybe the simpliest way to validate different character sets allowed for domain when you can't mix characters berween different sets is the following.
Allow the getCharacters method to return also arrays of different set characters as separate items. This needs to be done only in the interface comment (so that developers know that arrays can also be returned) and in the isValid method in HostnameValidator class. Now the hostname validation if array was returned should go like this: 1) check domain name without any special characters, if it is valid then validation is over with success 2) if it is not valid then check it again separately for every character set in array returned by the getCharacters method 3) if the domain name is valid only one of the checks from point 2 should return true, if none of the checks return true or if more then one return true then given value is not valid
Changes that needs to be done:
Change Zend/Validate/Hostname/Interface.php so that function getCharacters could return also arrays (just a comment fix)
Then modify the incoming Zend/Validate/Hostname/Pl.php into
Now the last change is in Zend/Validate/Hostname.php isValid() method. Change following lines (319 - 369):
into something like:
This should work if I haven't done any typos.
Posted by Simon R Jones (studio24) on 2007-09-26T13:07:59.000+0000
Thanks for the suggestions Lukasz.
I'd suggest the following small changes:
1) If IDN is enabled and if there are any special characters in the domain (i.e. anything other than a-z, dash "-", and 0-9) do the following steps: 2) Load the first allowed character set and try to validate the domain 3) If fail, go to the next allowed character set and try again 4) Repeat until get to the last allowed character set
I agree returning arrays from getCharacters is a good idea to support this. A mixed return type isn't ideal, but I don't think it's against the ZF Coding Standards.
Also, the whole domain (including any sub-domain parts) has to match one character set rather than allow any character set per domain part. I think your code would allow two different characters sets for two different domain parts.
I probably also need to look into whether sub-domains (i.e. www, dev, etc) can include international characters at all since if not this process is simplified and we can just check for the main "domain name" part (for example "domain" in "www.domain.com") against these characters sets.
However, another implementation I saw on the web (http://tldchk.berlios.de/) groups some character sets into external files which are then included. This suggests it's possible to share character sets across some IDN domains. I'll check into this first since if true this may make it easier to maintain and cope with domains that accept large character sets.
Posted by Wil Sinclair (wil) on 2008-04-18T13:12:04.000+0000
This doesn't appear to have been fixed in 1.5.0. Please update if this is not correct.
Posted by Simon R Jones (studio24) on 2008-04-24T11:50:54.000+0000
No this has not been fixed. Apologies, i had a baby (at least my wife did!) before Christmas and it's taken my life over somewhat. I will take a look at this issue and will post back a more helpful response within a week
Posted by Thomas Weidner (thomas) on 2009-03-21T14:58:32.000+0000
New feature implemented with the last rework of Zend_Validate_HostName.