UTF-8 SUPPORT
To build PCRE with support for UTF-8 character strings, add
--enable-utf8
to the configure command. Of itself, this does not make PCRE treat
strings as UTF-8. As well as compiling PCRE with this option, you also
have have to set the PCRE_UTF8 option when you call the pcre_compile()
function.
UTF-8 AND UNICODE PROPERTY SUPPORT
From release 3.3, PCRE has had some support for character strings
encoded in the UTF-8 format. For release 4.0 this was greatly extended
to cover most common requirements, and in release 5.0 additional sup-
port for Unicode general category properties was added.
In order process UTF-8 strings, you must build PCRE to include UTF-8
support in the code, and, in addition, you must call pcre_compile()
with the PCRE_UTF8 option flag. When you do this, both the pattern and
any subject strings that are matched against it are treated as UTF-8
strings instead of just strings of bytes.
If you compile PCRE with UTF-8 support, but do not use it at run time,
the library will be a bit bigger, but the additional run time overhead
is limited to testing the PCRE_UTF8 flag occasionally, so should not be
very big.
If PCRE is built with Unicode character property support (which implies
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
ported. The available properties that can be tested are limited to the
general category properties such as Lu for an upper case letter or Nd
for a decimal number, the Unicode script names such as Arabic or Han,
and the derived properties Any and L&. A full list is given in the
pcrepattern documentation. Only the short names for properties are sup-
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
ter}, is not supported. Furthermore, in Perl, many properties may
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
does not support this.
Original comment by
Christian Gräfe:
I can see why this change was made, but for me it is causing problems on certain platforms. On some, the extendes PCRE syntax "\p{}" doesn't seem to match anything. I could quite figure out what exactly is causing this problem. Up to now, I tried the following platforms:
Not working:
Fedora Core 5, PHP 5.1.6, PCRE 6.3
Fedora Core 6, PHP 5.1.6, PCRE 6.6
Working:
Fedora 7, PHP 5.2.2, PCRE 7.0
Solaris 10 x86, PHP 5.1.5, PCRE 6.6
Debian Etch, PHP 4.4.4, PCRE 6.7
Maybe, you could amend the docs to state the exact prerequisites to get Zend_Filter_Alnum et al. to work. That would surely help me a lot.