ZF-11223: Zend_Filter_StringTrim breaks UTF-8 string if last character is uppercased Cyrillic 'Р'

Issue Type: Bug Created: 2011-03-24T17:21:39.000+0000 Last Updated: 2011-09-25T09:06:32.000+0000 Status: Resolved Fix version(s): - Next Major Release ()

Reporter: Edward Surov (zooh) Assignee: Thomas Weidner (thomas) Tags: - Zend_Filter

Related issues: - ZF-10891



The filter tears away last byte of two-byte Cyrillic symbol 'Р' (U+0420, 0xD0|0xA0 sequence in UTF-8) when it's placed in the end of UTF-8 string (for example, in abbreviation 'ЮАР'), thus making the filtered UTF-8 string invalid.

The last byte (0xA0) corresponds to non-breaking space (U+00A0) in Latin-1, that's the reason why it gets trimmed. I think that trying to perform UTF-8 trimming without complete decoding UTF-8 to Unicode sequence will fail again and again in similar cases.


Posted by Edward Surov (zooh) on 2011-03-24T20:15:19.000+0000

I've worked out a solution that works. It's rather a hack than beautiful piece of code, but at least it works.

First, we check for empty string to avoid all the processing if there's nothing to trim from. Second, we add 'u' modifier to the pattern. Now, we're starting to get error in preg_replace sometimes (on Cyrillic string 'Ром', for example). If we face such error, we delegate the task of trimming to a slower procedure.

<pre class="highlight">
    protected function _unicodeTrim($value, $charlist = '\\\\s')
        if ('' == $value) {
            return $value;

        $chars = preg_replace(
            array( '/[\^\-\]\\\]/S', '/\\\{4}/S', '/\//'),
            array( '\\\\\\0', '\\', '\/' ),

        $pattern = '^[' . $chars . ']*|[' . $chars . ']*$';
        $result = preg_replace("/$pattern/usSD", '', $value);

        if (null === $result) {
            $result = $this->_slowUnicodeTrim($value, $chars);

        return $result;

In slower method we split the UTF-8 string into an array of single UTF-8 characters (using a simple hack with iconv() - UTF-32 characters are always 4 bytes long). Then we match separate characters against simple pattern that won't generate an internal error and remove them if necessary. In the end we just join the array again and return it.

<pre class="highlight">
    protected function _slowUnicodeTrim($value, $chars) {
        $utfChars = $this->_splitUtf8($value);
        $pattern = '/^[' . $chars . ']$/usSD';

        while ($utfChars && preg_match($pattern, $utfChars[0])) {

        while ($utfChars && preg_match($pattern, $utfChars[count($utfChars) - 1])) {

        return implode('', $utfChars);

    protected function _splitUtf8($value)
        $utfChars = str_split(iconv('UTF-8', 'UTF-32BE', $value), 4);
        array_walk($utfChars, create_function('&$char', '$char = iconv("UTF-32BE", "UTF-8", $char);'));
        return $utfChars;

Posted by Thomas Weidner (thomas) on 2011-07-24T20:52:20.000+0000

This issue duplicates ZF-10891

Posted by Thomas Weidner (thomas) on 2011-07-24T20:52:46.000+0000

This issue has been fixed with GH-107

Posted by Ramon Henrique Ornelas (ramon) on 2011-07-24T23:10:17.000+0000

@thomas some problem in apply this issue to next mini release (merges to branch 1.11) ;)?

Greetings Ramon

Posted by Thomas Weidner (thomas) on 2011-07-25T19:53:42.000+0000

Feel free to backport the fix from ZF2 to ZF1.

Have you found an issue?

See the Overview section for more details.


© 2006-2016 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.