Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Utf8 Component Proposal

Proposed Component Name Zend_Utf8
Developer Notes
Proposers Andrea Ercolino
Zend Liaison TBD
Revision 1.0 - 11 January 2011: Initial Draft. (wiki revision: 37)

Table of Contents

1. Overview

Zend_Utf8 is a simple component that offers escape and unescape functionalities. It's intended as a replacement for some code that is already available in ZF, though embedded in the Zend_Json and Zend_Serializer components. I've recently published a post about it at my site:

The Zend_Utf8 class is really simple, wholly coded, and ready for delivery, I hope. Note that still in the last release-1.11.2 the UTF-8 escaping feature in Zend_Json doesn't take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between 0×10000 and 0x10FFFF. This class does provide support for all unicode.

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it's sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework. When this class will be available, the Zend_Json and Zend_Serializer modules should be refactored to call Zend_Utf8 methods where needed.

WordPress Plugin

I've made a plugin for adding full UTF-8 support to WordPress. It's basically a wrapper of the class described here, which I have de-Zend-ified for distributing it in the wild.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

4. Dependencies on Other Framework Components

5. Theory of Operation

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I'll describe only usage of the latter.

In the use cases I'm going to use the following functions:


Options are needed for changing the default behavior, and must be provided as an associative array.

key type description
extendedUseSurrogate boolean It controls how an extended unicode character will be represented:
TRUE: a surrogate pair: write/read handlers will be called twice, each time receiving a member of the pair
FALSE: a code point: write/read handlers will be called once, receiving the code point
write handler It controls how a code point (or each member of a surrogate pair) will be written to the escaped output.
read handler It controls how a code point (or each member of a surrogate pair) will be read from the escaped input.
filters array before-write: handler
after-read: handler

A handler in the above table is an array with two keys: callback and arguments. Only the read handler has an additional key: a preg pattern to match against the escaped string for obtaining the string of an escaped value.

handler callback input callback output
write arguments + unicode integer (code point or member of surrogate pair) + (unescaped) UTF-8 character string of the escaped UTF-8 character
read arguments + pattern matches unicode integer (code point or member of surrogate pair)
before-write arguments + unescaped string unescape-safe string
after-read arguments + unescape-safe string unescaped string

6. Milestones / Tasks

  • Milestone 1: [DONE] Working prototype
  • Milestone 2: Unit tests exist, work, and are checked into SVN.
  • Milestone 3: Initial documentation exists.

7. Class Index

  • Zend_Utf8_Exception
  • Zend_Utf8

8. Use Cases

9. Class Skeletons

Actually, these are class implementations.


Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.