Issue Details (XML | Word | Printable)

Key: ZF-881
Type: Task Task
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Simon R Jones
Reporter: Darby Felton
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Google issue summary
Zend Framework

Zend_Validate_Hostname - UTF-8 hostnames valid?

Created: 08/Feb/07 02:23 PM   Updated: 05/Jul/07 02:43 PM   Resolved: 14/Mar/07 11:31 AM
Component/s: Zend_Validate
Affects Version/s: None
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments: 1. File Hostname.tgz (6 kB)
2. File HostnameTest.php (6 kB)
3. File HostnameTestForm.php (4 kB)
4. File HostnameTestStandalone.php (2 kB)

Issue Links:
Dependency
 
Related
 


 Description  « Hide

Unit tests for Zend_Validate_Hostname contain an entry for 'bürger.de' that is considered valid (the test fails with the current default regular expressions). Either UTF-8 hostnames must be considered invalid, or we should change the default regular expressions for Zend_Validate_Hostname to allow for such hostnames as 'bürger.de' to be considered valid.



Gavin added a comment - 08/Feb/07 03:04 PM

Both of the options above would produce erroneous behavior.

The correct set of permitted characters and the correct regular expression to properly identify valid and invalid hostnames depends on the TLD.

In summary, I think supporting the precise character set (and correct regular expression) permitted by each registry is a large task. Unless we find an organization maintaining these tables of information (similar to the CLDR), I think this is currently beyond our scope for ZF 1.0. Additionally, I would urge anyone creating an international website to consider the implications for foreigners regarding spoofing concerns, where visually identical (or almost indistinguishable) domain names might exist.

For example, consider the supported characters for the ".de" TLD:
http://www.denic.de/en/domains/idns/liste.html

More information on IDNA:
http://en.wikipedia.org/wiki/Internationalized_domain_name

Before IDNA was adopted, we only had the following:

http://www.icann.org/tlds/org/applications/isoc/AppendixJ.html#JE

E. Domain Name Character Set

1. Usage

This character set will be used in places where a domain name is to be specified. It does not govern the specification of internationalised domain names, which are not authorized by the current specification.

2. Definition

The rules for the format and character set of domain names are defined by the following:

dot = %x2E ; "."

alpha = %x41-5A | %x61-7A ; A-Z | a-z

digit = %x30-39 ; 0-9

dash = %x2D ; "-"

dns-char = alpha | digit | dash

id-prefix = alpha | digit

label = id-prefix [*61dns-char id-prefix]

sldn = label dot label ; not to exceed 254 characters

hostname = 1*(label dot) sldn; not to exceed 254 characters

Cheers,
Gavin


Darby Felton added a comment - 08/Feb/07 03:16 PM

Original message by Thomas Weidner:

Well... Denic changed some months ago the hostname supported signs...
since then also hostnames like www.kettenzüge.de are regular hostnames.

So from this view Zend_Validate_Hostname should also support those characters.
There are 92 new characters which are supported....

This means DNS names with for example chinese or UTF8 charsets are not supported.
But all characters like "ä ö ü ß é í á ó à è ì ò" and much more.

Here is a list of all new supported chars...
http://www.denic.de/de/domains/idns/liste.html

Greetings
Thomas
(I18N Team Leader)


Gavin added a comment - 08/Feb/07 03:22 PM

So from this view Zend_Validate_Hostname should also support those characters.

.. only for the ".de" TLD. Also, the regular expression for this TLD would not be the same as used by ICANN (see my first comment).


Thomas Weidner added a comment - 08/Feb/07 03:55 PM

Well... ICAN defines that all unicode characters can be supported.

2. In implementing the IDN standards, top-level domain registries will employ an "inclusion-based" approach (meaning that code points that are not explicitly permitted by the registry are prohibited) for identifying permissible code points from among the full Unicode repertoire.

3. In implementing the IDN standards, top-level domain registries will (a) associate each registered internationalized domain name with one language or set of languages, (b) employ language-specific registration and administration rules that are documented and publicly available, such as the reservation of all domain names with equivalent character variants in the languages associated with the registered domain name, and, (c) where the registry finds that the registration and administration rules for a given language would benefit from a character variants table, allow registrations in that language only when an appropriate table is available.

So the question is if we get a list of supported unicode charsets for each IDN registar or if we should accept all unicode characters.


Ahmed Shreef added a comment - 08/Feb/07 05:10 PM

ok, just accept any Unicode characters. I found that there is some host names like " http://www.جهينة.com/ " .

I read minutes ago an Arabic forum post about this type of domains. one of the guys said that you can just get a domain like this from any domains registrar . for example, if you want to register the domain " http://www.جهينة.com/ " just register the domain " http://www.xn--ogbf2fdp.com/ ". if the user types the Arabic domain, the browser will translate the Arabic characters to a mix of English ones and will take the user to the website.

so if we can do this in Arabic, there will be possibility to create Japanese and Chinese ones too .



Gavin added a comment - 08/Feb/07 07:24 PM

I expect the managers of TLDs would heavily frown on permitting illegal characters in domain names for their TLDs.

For example, using a non-Icelandic approved characters for a ".is" domain would open the door to a form of spoofing attacks, by allowing other characters that look similar or identical to the special characters added by the ".is" TLD to the permitted list of characters for their domain names.


Simon R Jones added a comment - 09/Feb/07 03:40 AM

We have a French member of staff here, French domains (.fr) do not use special characters - the official site indicates tthe following characters are only allowed: "a-z", "0-9" and "-"

As of 2004 Belgium domains (.be) can include special characters as defined by the IDN. I am not sure how widely used this is, however.

I agree with Thomas it is useful to support these character sets. A sensible approach would be to accept a restrictive set of characters for all domains, then extend this on a tld by tld basis. Each tld should then allow additional character sets as defined by the relevant authority.

I don't see how we can hope to cover all domains, but at least a core number can be achieved and as long as users can extend the system to add their own characters for other domains that should be workable.

Of course all of this is different for local hostnames. I think there was talk that the Hostname validator should allow local name validations. Is this right?


Thomas Weidner added a comment - 09/Feb/07 12:34 PM

Well, for DE such a list exists...

I am sure that also for other IDNs such a list can be found.
So if we dont accept all unicode characters we should only include the IDN's from which we have the extended charset.

Others would only have ToAscii instead of ToUnicode.

@Ahmed:
What you've mentioned does not what's meant here.

Not all IDNs accept unicode.
http://www.xn--ogbf2fdp.com/ is only what a IDN does after receiving unicode characters.

http://www.جهينة.com/ is the unicode which we have to parse/validate.
The unicode letters are then changed by the ToAscii function in the browser/IDN to Ascii letters, where xn-- (and also others) are defining which unicode charset to use.

But an arabic IDN would not accept xd-- because it's not arabic.
Look at the links which gavin placed for further explanation.


Simon R Jones added a comment - 11/Feb/07 04:56 PM

My suggested approach (which I am now testing) involves having a more basic regex (as per the IANA spec mentioned above) for standard domains which matches up with that specified at iana.org

I will also match against a known TLD for DNS-based domains. If a special regex exists for a TLD, then this is used. I'll set up .de and .be as a test to start with

I will have a large array for the known TLDs (generated from ftp://data.iana.org/TLD/tlds-alpha-by-domain.txt) and custom regexes for each TLD that we set up for. I will store these as properties in Zend_Validate_Hostname.

I did think about seperating them out into different files in something like Zend_Validate_Hostname/Data but the benefit of such organisation seems outweighed by the speed hit such an act will inpact and the fact the data will be so small.

I have a script to generate the TLD array. Should this be stored somewhere within the ZF folder structure?

Darby, I'll probably have to drop your regex constant organisation since the DNS regex will be too complex to fit into one line and will require further checks (i.e. against the tld). Hope that's ok!


Simon R Jones added a comment - 12/Feb/07 03:39 PM

I've added support for the following countries: Austria (.AT), Switzerland (.CH), Liechtenstein (.LI), Germany (.DE), Finland (.FE), Hungary (.HU), Norway (.NO), and Sweden (.SE). Seems Belgium doesn't accept IDN at present.

So far I've placed the additional characters regex in Zend_Validate_Hostname but this isn't going to work for Far East languages such as Japan (which has 6534 additional characters available for IDN domains according to http://www.iana.org/assignments/idn/jp-japanese.html ). May have to rethink how the country regexes are stored.

Any objections to storing these in a Zend_Validate_Hostname/Data folder as straightforward PHP arrays? I could encapsulate these in a class such as Zend_Validate_Hostname_Data_Jp if this is considered better practise.

I also notice exceptions are raised in the current Zend_Validate_Hostname, seemingly to detect whether a preg_match fails. This seems different from all other Validator classes and gets a bit messy now I've extended the validation somewhat for domains. Is it OK to just set messages wherever there's a failure?

Any input appreciated, otherwise I'll go ahead and commit these changes in the next day or two.


Gavin added a comment - 12/Feb/07 04:37 PM

I have a script to generate the TLD array. Should this be stored somewhere within the ZF folder structure?

Yes, see the build-tools/ directory.

Eventually, some will use a configuration scripts for installing the ZF and/or ZF apps.
This will some day affect how developers and users want to deal with the bloat associated with supporting all locales, TLDs, and other optional code in the ZF.

Ideally, using isHostname() would not force all TLD regex's into RAM
I personally like the efficient approach of storing appropriate data into PHP arrays in *.php files, ready for inclusion, if needed.


Simon R Jones added a comment - 13/Feb/07 05:17 PM

Having a issue matching UTF-8 characters within a preg_match using hex characters. From the docs \x{XXXX} should match a UTF-8 character, however the following two tests return false (when they should return true)

var_dump(preg_match("/\x{00E4}/u", 'ä'));

var_dump(preg_match("/^[a-zA-Z0-9\x{00E0}\x{00E1}\x{00E2}\x{00E3}\x{00E4}\x{00E5}\x{00E6}\x{00E7}\x{00E8}\x{00E9}\x{00EA}\x{00EB}\x{00EC}\x{00ED}\x{00EE}\x{00EF}\x{00F0}\x{00F1}\x{00F2}\x{00F3}\x{00F4}\x{00F5}\x{00F6}\x{00F8}\x{00F9}\x{00FA}\x{00FB}\x{00FC}\x{00FD}\x{00FE}\x{00FF}\x{0153}\x{0161}\x{017E}]{1,63}$/u", "bächer"));

If I remove the /u modifier the first example works, but the second fails on the preg_match with a warning: preg_match(): Compilation failed: character value in \x{...} sequence is too large at offset 266 in C:\wamp\www\zf-test\mb_test.php on line 7

This seems to indicate the character \x{0153} but that is valid

Anyone have experience in this?


Simon R Jones added a comment - 13/Feb/07 05:19 PM

sorry about the long line length, if someone can edit the code example to go across multiple lines that would be good


Simon R Jones added a comment - 15/Feb/07 08:37 AM

Revision 3429, which will be released with 0.8, has a rewrite of DNS hostname matching which is accurate for all normal domain names including a check for a current valid TLD.

I have developed test scripts for IDN support for a number of country TLDs but due to the complexity of testing UTF-8 characters this isn't ready for release at this time. I will continue testing with the aim to have this complete for 0.9

So far I have placed arrays for TLD additional characters in separate files within Zend/Validate/Hostname/ with files named in the format De.php and the class name Zend_Validate_Hostname_De - these classes are based on an interface which describes their usage.

IDN characters are stored as 4-char hex codes, ie

\x{00E3}

While this is fine for regexes, I've had problems including UTF-8 characters in the Unit tests. I believe this to be to do with the Byte-Order-Mark of the actual file. Zend_Validate_Hostname will likely be used in the real world for validating data from POST or a database such as MySQL so BOMs may be less important. I will be undertaking some tests to see what needs to be done to ensure real-world IDN domain name matching works with UTF-8 characters.

I've changed the default behaviour of Zend_Validate_Hostname to only match DNS hostnames since after discussion with my co-developers in the office I believe this is the most common practical usage of Zend_Validate_Hostname. Other options can be set as normal via the contstructor. The ability to change the regex for local domains has been kept, but this may be deprecated by 0.9

Usage documentation for the current version has been added to the end-user manual.


Philip Iezzi added a comment - 25/Feb/07 05:16 AM

Why should we take the complexity of validating UTF-8 characters in Zend_Validate_Hostname?
Doesn't it make more sense to just allow punycoded IDNA hostnames? I guess this regex would be much easier to implement.

'kettenzüge.de' would result in 'xn--kettenzge-w9a.de' which currently does not validate in Zend_Validate_Hostname.
To make such a hostname available for other applications (e.g. zonefiles in Bind or Apache's VirtualHosts) anyway we need to convert them to punycode. Why not just do this prior to validation to simplify life?
I'm using PEAR's Net_IDNA, e.g.:

$idna = Net_IDNA::getInstance();
$encodedHostname = $idna->encode(utf8_encode($hostname));

$validator = new Zend_Validate_Hostname();
if (!$validator->isValid($encodedHostname)) {
    // would be nice if this would validate!
}

Simon R Jones added a comment - 13/Mar/07 02:21 PM

I have added IDN domain support to Zend_Validate_Hostname to match a domain such as bürger.de

However I've had some real problems getting the actual Unit Tests to run and validate an IDN domain. Most of these seem to be due to UTF-8 characters in the test file getting mucked up (encoding is great fun!).

So far I cannot actually run my Unit Tests successfully (I am on Windows with WAMP and so far have not managed to get PHPUnit running on a Unix box here) but if I copy the code into standalone test scripts using the actual Zend Framework validate functions they pass fine. Test scripts which validate a hostname via POST also work fine (given form encoding is working OK).

I would like someone with more experience of PHPUnit to run my tests to see if they pass OK. If so, I'll submit the code into the repository and hopefully all will be well.

Otherwise is it OK to commit this new feature without a Unit Test? I rather suspect even if we can sort out encoding in one file Subversion may muck up the encoding and thus make it impossible to get a stable Unit test for this in the trunk.


Simon R Jones added a comment - 13/Mar/07 03:31 PM

sorry, forgot - Hostname unit test file attached. This file should be encoded as UTF-8


Simon R Jones added a comment - 13/Mar/07 03:56 PM

Final comment today.. I am using mb_strtolower() to reliably force hostnames to lower-case for matching against IDN characters.

Looking at comments on ZF-269 this seems not to be the ideal way to do things? If this is so, I can look at adding all the upper-case versions of the allowed characters though it may take a while..


Thomas Weidner added a comment - 13/Mar/07 05:17 PM

only one small point:

Why do you use the mb-extension ?
I thought that it is depreciated and instead iconv should be used ?

Btw:
Instead of Zend_Locale_UTF8 (ZF-269) we use the iconv library for localization.


Simon R Jones added a comment - 14/Mar/07 03:55 AM

Thomas: can I convert to lower case with iconv? From looking over the manual I can't find a solution to that with iconv


Thomas Weidner added a comment - 14/Mar/07 05:03 AM
public function iconv_strtolower($value)
{
    return iconv(strtolower(iconv($value, "UTF-8", "windows-1251")), "windows-1251", "UTF-8");
}

So simple that sometimes you dont see the tree within the wood


Thomas Weidner added a comment - 14/Mar/07 05:08 AM

For linux an other encoding should be selected of course.


Simon R Jones added a comment - 14/Mar/07 06:27 AM

Thanks Thomas!

Am testing with the i modifier in preg which seems to be UTF-8 safe if I'm in UTF-8 mode


Simon R Jones added a comment - 14/Mar/07 06:43 AM

Silly me, using the i and u modifier works fine for UTF-8 safe lower-casing regexes. Thanks for the advice though Thomas

I have had to create two standalone test scripts in the tests/Zend/Validate folder since I can't get the Unit Test to work with IDN characters.

The first is tests/Zend/Validate/HostnameTestStandalone.php which is designed to be run on the command line.

The second is tests/Zend/Validate/HostnameTestForm.php which is designed to be run via HTML to allow users to test entering UTF-8 characters in a form.

Is it OK to commit these to SVN or is this not really the place for non-unit tests?

Finally, I had some issues trying to filter incoming POST data with UTF-8 characters. Bar stripping tags, any advice on how to make use of Zend_Filter on wider character sets?


Simon R Jones added a comment - 14/Mar/07 11:31 AM

Fixed in version 3927

IDN support is described in Zend_Validate_Hostname and Zend_Validate_Hostname_Interface. Documentation to be updated shortly.

Initially supported TLDs are: .at, .ch, .li, .de, .fi, .hu, .no, .se

I have a script to create a .jp character list but since it weighs in over 6000 characters I want to test this on smaller character set IDNs as mentioned above


Simon R Jones added a comment - 14/Mar/07 11:36 AM

In case of any issues with testing the Unit test for Hostname, please try this standalone script - HostnameTestStandalone.php


Simon R Jones added a comment - 14/Mar/07 11:37 AM

In case of any issues with testing the Unit test for Hostname, please try this standalone script intended for testing IDN domains via a form - HostnameTestForm.php


Gavin added a comment - 27/Mar/07 01:31 PM

Summarizing from historical threads:

  • the mbstring extension is not deprecated and is the recommended way to fully UTF-8 enable an entire PHP application
  • the mbstring extension should not be required in ZF /library core code (the extension is not a ZF requirement)
  • the /u modifier for PCRE does not work in all conditions and situations (see past topic threads and pcre.org for details)
  • test suites might make use of more "tools" (e.g. mbstring) than are available in ZF /library core, but the tests then need to be optional