|
Original message Well... Denic changed some months ago the hostname supported signs... So from this view Zend_Validate_Hostname should also support those characters. This means DNS names with for example chinese or UTF8 charsets are not supported. Here is a list of all new supported chars... Greetings Well... ICAN defines that all unicode characters can be supported. 2. In implementing the IDN standards, top-level domain registries will employ an "inclusion-based" approach (meaning that code points that are not explicitly permitted by the registry are prohibited) for identifying permissible code points from among the full Unicode repertoire. 3. In implementing the IDN standards, top-level domain registries will (a) associate each registered internationalized domain name with one language or set of languages, (b) employ language-specific registration and administration rules that are documented and publicly available, such as the reservation of all domain names with equivalent character variants in the languages associated with the registered domain name, and, (c) where the registry finds that the registration and administration rules for a given language would benefit from a character variants table, allow registrations in that language only when an appropriate table is available. So the question is if we get a list of supported unicode charsets for each IDN registar or if we should accept all unicode characters. ok, just accept any Unicode characters. I found that there is some host names like " http://www I read minutes ago an Arabic forum post about this type of domains. one of the guys said that you can just get a domain like this from any domains registrar . for example, if you want to register the domain " http://www so if we can do this in Arabic, there will be possibility to create Japanese and Chinese ones too . I expect the managers of TLDs would heavily frown on permitting illegal characters in domain names for their TLDs. For example, using a non-Icelandic approved characters for a ".is" domain would open the door to a form of spoofing attacks, by allowing other characters that look similar or identical to the special characters added by the ".is" TLD to the permitted list of characters for their domain names. We have a French member of staff here, French domains (.fr) do not use special characters - the official site indicates tthe following characters are only allowed: "a-z", "0-9" and "-" As of 2004 Belgium domains (.be) can include special characters as defined by the IDN. I am not sure how widely used this is, however. I agree with Thomas it is useful to support these character sets. A sensible approach would be to accept a restrictive set of characters for all domains, then extend this on a tld by tld basis. Each tld should then allow additional character sets as defined by the relevant authority. I don't see how we can hope to cover all domains, but at least a core number can be achieved and as long as users can extend the system to add their own characters for other domains that should be workable. Of course all of this is different for local hostnames. I think there was talk that the Hostname validator should allow local name validations. Is this right? Well, for DE such a list exists... I am sure that also for other IDNs such a list can be found. Others would only have ToAscii instead of ToUnicode. @Ahmed: Not all IDNs accept unicode. http://www But an arabic IDN would not accept xd-- because it's not arabic. My suggested approach (which I am now testing) involves having a more basic regex (as per the IANA spec mentioned above) for standard domains which matches up with that specified at iana.org I will also match against a known TLD for DNS-based domains. If a special regex exists for a TLD, then this is used. I'll set up .de and .be as a test to start with I will have a large array for the known TLDs (generated from ftp://data.iana.org/TLD/tlds-alpha-by-domain.txt I did think about seperating them out into different files in something like Zend_Validate_Hostname/Data but the benefit of such organisation seems outweighed by the speed hit such an act will inpact and the fact the data will be so small. I have a script to generate the TLD array. Should this be stored somewhere within the ZF folder structure? Darby, I'll probably have to drop your regex constant organisation since the DNS regex will be too complex to fit into one line and will require further checks (i.e. against the tld). Hope that's ok! I've added support for the following countries: Austria (.AT), Switzerland (.CH), Liechtenstein (.LI), Germany (.DE), Finland (.FE), Hungary (.HU), Norway (.NO), and Sweden (.SE). Seems Belgium doesn't accept IDN at present. So far I've placed the additional characters regex in Zend_Validate_Hostname but this isn't going to work for Far East languages such as Japan (which has 6534 additional characters available for IDN domains according to http://www.iana.org/assignments/idn/jp-japanese.html Any objections to storing these in a Zend_Validate_Hostname/Data folder as straightforward PHP arrays? I could encapsulate these in a class such as Zend_Validate_Hostname_Data_Jp if this is considered better practise. I also notice exceptions are raised in the current Zend_Validate_Hostname, seemingly to detect whether a preg_match fails. This seems different from all other Validator classes and gets a bit messy now I've extended the validation somewhat for domains. Is it OK to just set messages wherever there's a failure? Any input appreciated, otherwise I'll go ahead and commit these changes in the next day or two.
Yes, see the build-tools/ directory. Eventually, some will use a configuration scripts for installing the ZF and/or ZF apps. Ideally, using isHostname() would not force all TLD regex's into RAM Having a issue matching UTF-8 characters within a preg_match using hex characters. From the docs \x{XXXX} should match a UTF-8 character, however the following two tests return false (when they should return true) var_dump(preg_match("/\x{00E4}/u", 'ä')); var_dump(preg_match("/^[a-zA-Z0-9\x{00E0}\x{00E1}\x{00E2}\x{00E3}\x{00E4}\x{00E5}\x{00E6}\x{00E7}\x{00E8}\x{00E9}\x{00EA}\x{00EB}\x{00EC}\x{00ED}\x{00EE}\x{00EF}\x{00F0}\x{00F1}\x{00F2}\x{00F3}\x{00F4}\x{00F5}\x{00F6}\x{00F8}\x{00F9}\x{00FA}\x{00FB}\x{00FC}\x{00FD}\x{00FE}\x{00FF}\x{0153}\x{0161}\x{017E}]{1,63}$/u", "bächer")); If I remove the /u modifier the first example works, but the second fails on the preg_match with a warning: preg_match(): Compilation failed: character value in \x{...} sequence is too large at offset 266 in C:\wamp\www\zf-test\mb_test.php on line 7 This seems to indicate the character \x{0153} but that is valid Anyone have experience in this? sorry about the long line length, if someone can edit the code example to go across multiple lines that would be good Revision 3429, which will be released with 0.8, has a rewrite of DNS hostname matching which is accurate for all normal domain names including a check for a current valid TLD. I have developed test scripts for IDN support for a number of country TLDs but due to the complexity of testing UTF-8 characters this isn't ready for release at this time. I will continue testing with the aim to have this complete for 0.9 So far I have placed arrays for TLD additional characters in separate files within Zend/Validate/Hostname/ with files named in the format De.php and the class name Zend_Validate_Hostname_De - these classes are based on an interface which describes their usage. IDN characters are stored as 4-char hex codes, ie \x{00E3}
While this is fine for regexes, I've had problems including UTF-8 characters in the Unit tests. I believe this to be to do with the Byte-Order-Mark of the actual file. Zend_Validate_Hostname will likely be used in the real world for validating data from POST or a database such as MySQL so BOMs may be less important. I will be undertaking some tests to see what needs to be done to ensure real-world IDN domain name matching works with UTF-8 characters. I've changed the default behaviour of Zend_Validate_Hostname to only match DNS hostnames since after discussion with my co-developers in the office I believe this is the most common practical usage of Zend_Validate_Hostname. Other options can be set as normal via the contstructor. The ability to change the regex for local domains has been kept, but this may be deprecated by 0.9 Usage documentation for the current version has been added to the end-user manual. Why should we take the complexity of validating UTF-8 characters in Zend_Validate_Hostname? 'kettenzüge.de' would result in 'xn--kettenzge-w9a.de' which currently does not validate in Zend_Validate_Hostname. $idna = Net_IDNA::getInstance(); $encodedHostname = $idna->encode(utf8_encode($hostname)); $validator = new Zend_Validate_Hostname(); if (!$validator->isValid($encodedHostname)) { // would be nice if this would validate! } I have added IDN domain support to Zend_Validate_Hostname to match a domain such as bürger.de However I've had some real problems getting the actual Unit Tests to run and validate an IDN domain. Most of these seem to be due to UTF-8 characters in the test file getting mucked up (encoding is great fun!). So far I cannot actually run my Unit Tests successfully (I am on Windows with WAMP and so far have not managed to get PHPUnit running on a Unix box here) but if I copy the code into standalone test scripts using the actual Zend Framework validate functions they pass fine. Test scripts which validate a hostname via POST also work fine (given form encoding is working OK). I would like someone with more experience of PHPUnit to run my tests to see if they pass OK. If so, I'll submit the code into the repository and hopefully all will be well. Otherwise is it OK to commit this new feature without a Unit Test? I rather suspect even if we can sort out encoding in one file Subversion may muck up the encoding and thus make it impossible to get a stable Unit test for this in the trunk. sorry, forgot - Hostname unit test file attached. This file should be encoded as UTF-8 Final comment today.. I am using mb_strtolower() to reliably force hostnames to lower-case for matching against IDN characters. Looking at comments on only one small point: Why do you use the mb-extension ? Btw: Thomas: can I convert to lower case with iconv? From looking over the manual I can't find a solution to that with iconv public function iconv_strtolower($value) { return iconv(strtolower(iconv($value, "UTF-8", "windows-1251")), "windows-1251", "UTF-8"); } So simple that sometimes you dont see the tree within the wood For linux an other encoding should be selected of course. Thanks Thomas! Am testing with the i modifier in preg which seems to be UTF-8 safe if I'm in UTF-8 mode Silly me, using the i and u modifier works fine for UTF-8 safe lower-casing regexes. Thanks for the advice though Thomas I have had to create two standalone test scripts in the tests/Zend/Validate folder since I can't get the Unit Test to work with IDN characters. The first is tests/Zend/Validate/HostnameTestStandalone.php which is designed to be run on the command line. The second is tests/Zend/Validate/HostnameTestForm.php which is designed to be run via HTML to allow users to test entering UTF-8 characters in a form. Is it OK to commit these to SVN or is this not really the place for non-unit tests? Finally, I had some issues trying to filter incoming POST data with UTF-8 characters. Bar stripping tags, any advice on how to make use of Zend_Filter on wider character sets? Fixed in version 3927 IDN support is described in Zend_Validate_Hostname and Zend_Validate_Hostname_Interface. Documentation to be updated shortly. Initially supported TLDs are: .at, .ch, .li, .de, .fi, .hu, .no, .se I have a script to create a .jp character list but since it weighs in over 6000 characters I want to test this on smaller character set IDNs as mentioned above In case of any issues with testing the Unit test for Hostname, please try this standalone script - HostnameTestStandalone.php In case of any issues with testing the Unit test for Hostname, please try this standalone script intended for testing IDN domains via a form - HostnameTestForm.php Summarizing from historical threads:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Both of the options above would produce erroneous behavior.
The correct set of permitted characters and the correct regular expression to properly identify valid and invalid hostnames depends on the TLD.
In summary, I think supporting the precise character set (and correct regular expression) permitted by each registry is a large task. Unless we find an organization maintaining these tables of information (similar to the CLDR), I think this is currently beyond our scope for ZF 1.0. Additionally, I would urge anyone creating an international website to consider the implications for foreigners regarding spoofing concerns, where visually identical (or almost indistinguishable) domain names might exist.
For example, consider the supported characters for the ".de" TLD:
http://www.denic.de/en/domains/idns/liste.html
More information on IDNA:
http://en.wikipedia.org/wiki/Internationalized_domain_name
Before IDNA was adopted, we only had the following:
http://www.icann.org/tlds/org/applications/isoc/AppendixJ.html#JE
Cheers,
Gavin