Skip to end of metadata
Go to start of metadata

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Utf8 Component Proposal

Proposed Component Name Zend_Utf8
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Utf8
Proposers Andrea Ercolino
Zend Liaison TBD
Revision 1.0 - 11 January 2011: Initial Draft. (wiki revision: 40)

Table of Contents

1. Overview

Zend_Utf8 is a simple component that offers escape and unescape functionalities. It's intended as a replacement for some code that is already available in ZF, though embedded in the Zend_Json and Zend_Serializer components. I've recently published a post about it at my site: http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/

The Zend_Utf8 class is really simple, wholly coded, and ready for delivery, I hope. Note that still in the last release-1.11.2 the UTF-8 escaping feature in Zend_Json doesn't take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between 0×10000 and 0x10FFFF. This class does provide support for all unicode.

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it's sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework. When this class will be available, the Zend_Json and Zend_Serializer modules should be refactored to call Zend_Utf8 methods where needed.

WordPress Plugin

I've made a plugin for adding full UTF-8 support to WordPress. It's basically a wrapper of the class described here, which I have de-Zend-ified for distributing it in the wild.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

4. Dependencies on Other Framework Components

5. Theory of Operation

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I'll describe only usage of the latter.

In the use cases I'm going to use the following functions:

Options

Options are needed for changing the default behavior, and must be provided as an associative array.

key type description
extendedUseSurrogate boolean It controls how an extended unicode character will be represented:
TRUE: a surrogate pair: write/read handlers will be called twice, each time receiving a member of the pair
FALSE: a code point: write/read handlers will be called once, receiving the code point
write handler It controls how a code point (or each member of a surrogate pair) will be written to the escaped output.
read handler It controls how a code point (or each member of a surrogate pair) will be read from the escaped input.
filters array before-write: handler
after-read: handler

A handler in the above table is an array with two keys: callback and arguments. Only the read handler has an additional key: a preg pattern to match against the escaped string for obtaining the string of an escaped value.

handler callback input callback output
write arguments + unicode integer (code point or member of surrogate pair) + (unescaped) UTF-8 character string of the escaped UTF-8 character
read arguments + pattern matches unicode integer (code point or member of surrogate pair)
before-write arguments + unescaped string unescape-safe string
after-read arguments + unescape-safe string unescaped string

6. Milestones / Tasks

  • Milestone 1: [DONE] Working prototype
  • Milestone 2: Unit tests exist, work, and are checked into SVN.
  • Milestone 3: Initial documentation exists.

7. Class Index

  • Zend_Utf8_Exception
  • Zend_Utf8

8. Use Cases

9. Class Skeletons

Actually, these are class implementations.

]]></ac:plain-text-body></ac:macro>

]]></ac:plain-text-body></ac:macro>

Labels:
json json Delete
utf8 utf8 Delete
unicode unicode Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jan 11, 2011

    <p>I do not see any advantage over PHPs multibye string functions - <a class="external-link" href="http://php.net/manual/de/book.mbstring.php">http://php.net/manual/de/book.mbstring.php</a></p>

    <p>Can you tell me, what are the advantages of Zend_Utf8 over PHPs multibyte string functions?</p>

    1. Jan 11, 2011

      <p>If you look at the referenced files, you'll see that a weaker version of this code is already deployed in ZF.</p>

      <p>As for the advantages, this code uses mb_convert_encoding if available, and if not, it gives the expected result anyway.</p>

      <p>As for the rationale, this class' purpose is escaping and unescaping, using user configurable formats, being the default one the same used in json_encode / json_decode.</p>

  2. Jan 19, 2011

    <p>Great job Andrea!  I was about to suggest a similar class called Zend_Unicode or Zend_Encoding with support for the following:</p>
    <ol>
    <li>Full support for manipulating characters and strings in the 7 standard Unicode Character Encoding Schemes
    <ol>
    <li>UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE
    <ol>
    <li>Character to code point to HTML entity to hex to decimal to binary, back and forth, etc.</li>
    <li>Parse and split on word boundaries with ease.</li>
    <li>Filter out erroneous characters such as unwanted control codes.</li>
    <li>Bit representation and bit packing schemes, Endianness, etc.</li>
    </ol>
    </li>
    </ol>
    </li>
    <li>Normalization
    <ol>
    <li>Unicode Normalization Forms (NFC, NFD, NFKC, and NFKD).
    <ol>
    <li>Most importantly, convert combining characters into their single code point form for ease of regex and string processing in data feeds and XML. </li>
    </ol>
    </li>
    </ol>
    </li>
    <li>Queries against the Unicode Character Database (UCD) by character, decimal, hex, binary, code point, etc. to allow full character map tool functionality
    <ol>
    <li>Full UCD functionality with the data included in self-contained arrays or text files for lookups and queries.</li>
    <li><a href="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt">http://www.unicode.org/Public/UNIDATA/UnicodeData.txt</a></li>
    </ol>
    </li>
    <li>Render ASCII and Unicode (UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE) tables
    <ol>
    <li>Show overlays (compatibility or incompatibility) for a given character or code point across encodings.</li>
    </ol>
    </li>
    <li>Data Scrubbing and advanced string processing for use with data and data feeds
    <ol>
    <li>Custom regex routines for the most common data scrubbing and parsing tasks.</li>
    <li>Normalization routines to check for equivalency without losing the original format.</li>
    </ol>
    </li>
    <li>Collation and Sorting
    <ol>
    <li>Allow string output via a variety of collation algorithms applied to the same input string.</li>
    </ol>
    </li>
    <li>MySQL latin1 to utf8 charset conversion routines.
    <ol>
    <li>Helper methods to facilitate migrating from latin1 to utf8 in MySQL databases, tables, and fields.</li>
    </ol>
    </li>
    <li>Character Set Analysis
    <ol>
    <li>Methods that take a UTF-8 encoded string and determine the most efficient encoding for storage in a field in MySQL or any other database to optimize the data footprint, etc.</li>
    </ol>
    </li>
    </ol>