ZF-1641: PCRE UTF-8 Support Unavailable on Some Platforms

Description

Filter and validation classes such as Zend_Filter_Alpha are known to make use of the {{/u}} PCRE pattern modifier, but UTF-8 support is not compiled by default into various PHP distributions.

Comments

Original comment by [~cgraefe]:

I can see why this change was made, but for me it is causing problems on certain platforms. On some, the extendes PCRE syntax "\p{}" doesn't seem to match anything. I could quite figure out what exactly is causing this problem. Up to now, I tried the following platforms:

Not working: Fedora Core 5, PHP 5.1.6, PCRE 6.3 Fedora Core 6, PHP 5.1.6, PCRE 6.6

Working: Fedora 7, PHP 5.2.2, PCRE 7.0 Solaris 10 x86, PHP 5.1.5, PCRE 6.6 Debian Etch, PHP 4.4.4, PCRE 6.7

Maybe, you could amend the docs to state the exact prerequisites to get Zend_Filter_Alnum et al. to work. That would surely help me a lot.

I am also have this issue on

RHEL5, PHP 5.1.6, PCRE 6.6

I check redhat default build spec for PCRE 6.6 and --enable-utf8 is specified. At this point I am not 100% sure it is UTF-8 related but I do know since ZF1.0RC3 the filters Zend_Filter_Alnum and Zend_Filter_Alpha are not matching strings that should be valid.

These patterns continually fail regardless of the value input:

$pattern = '/[^\p{L}\p{N}' . ($this->allowWhiteSpace ? '\s' : '') . ']/u'; $pattern = '/[^\p{L}\p{N}' . ($this->allowWhiteSpace ? '\s' : '') . ']/';

So I modified the pattern to use the old alnum which works both with and without utf-8 specified. So these patterns are working:

$pattern = '/[^[:alnum:]' . ($this->allowWhiteSpace ? '\s' : '') . ']/u'; $pattern = '/[^[:alnum:]' . ($this->allowWhiteSpace ? '\s' : '') . ']/';

The Unicode property patterns (the ones beginning with {{\p}}) will not work if UTF-8 support is unavailable in PCRE.

From the www.pcre.org/pcre.txt" rel="nofollow">PCRE man pages:


UTF-8 SUPPORT

       To build PCRE with support for UTF-8 character strings, add

         --enable-utf8

       to  the  configure  command.  Of  itself, this does not make PCRE treat
       strings as UTF-8. As well as compiling PCRE with this option, you  also
       have  have to set the PCRE_UTF8 option when you call the pcre_compile()
       function.

UTF-8 AND UNICODE PROPERTY SUPPORT

       From  release  3.3,  PCRE  has  had  some support for character strings
       encoded in the UTF-8 format. For release 4.0 this was greatly  extended
       to  cover  most common requirements, and in release 5.0 additional sup-
       port for Unicode general category properties was added.

       In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
       support  in  the  code,  and, in addition, you must call pcre_compile()
       with the PCRE_UTF8 option flag. When you do this, both the pattern  and
       any  subject  strings  that are matched against it are treated as UTF-8
       strings instead of just strings of bytes.

       If you compile PCRE with UTF-8 support, but do not use it at run  time,
       the  library will be a bit bigger, but the additional run time overhead
       is limited to testing the PCRE_UTF8 flag occasionally, so should not be
       very big.

       If PCRE is built with Unicode character property support (which implies
       UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-
       ported.  The available properties that can be tested are limited to the
       general category properties such as Lu for an upper case letter  or  Nd
       for  a  decimal number, the Unicode script names such as Arabic or Han,
       and the derived properties Any and L&. A full  list  is  given  in  the
       pcrepattern documentation. Only the short names for properties are sup-
       ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-
       ter},  is  not  supported.   Furthermore,  in Perl, many properties may
       optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
       does not support this.

From testing on my platforms ( openSUSE/SUSE-OSS 10.0, 10.1, 10.2 ), I have the following results:

Up to ( but not including ) apache2-mod_php5-5.2.0-10.rpm testing against the PCRE expression in current Zend_Filter_Alnum passes.

From mod_php5 version that ships with openSUSE 10.2 ( 5.2.0-10 ) to the current available in the openSUSE build service, ( 5.2.3-29.1 ), testing against the pattern fails.

On openSUSE mod_php5-5.2.0-10 and greater is built against the system PCRE library and not the PHP bundled one, on opensuse 10.2 this is pcre-6.7-21. While patterns with UTF-8 options match against this system library outside of PHP; they do not, for whatever reason match when used inside of PHP.

Workaround, on openSUSE 10.2 an upgrade of the system PCRE to the latest available from the openSUSE build service ( pcre-7.1-16 ) will allow pattern matching inside of PHP to function as expected.

Issue resolved in r5468

Zend_Filter_Alpha and Zend_Filter_Digits also use Unicode property matching.

Sorry, I forgat about those ..... It should be fixed now. Graham Anderson or anyone else running a none UTF-8 enabled system, could you please run the unit tests for: Zend_Filter_AllTests?

I'll await your response before i close the issue.

Thanks to Julian Davchev for testing this:


Here stuff on local machine:

svn update U tests/Zend/Filter/AlphaTest.php U library/Zend/Filter/Alnum.php U library/Zend/Filter/Digits.php U library/Zend/Filter/Alpha.php Updated to revision 5469. At revision 5469.

jmut@dexter:/storage/www/frameworks/zendframework$ php tests/Zend/Filter/AllTests.php PHPUnit 3.0.0 by Sebastian Bergmann.

......................................... ......................................... ...........................

Time: 00:00

OK (109 tests)