<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[
<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[
Zend_Filter_Transliteration is a component that transliterates utf8 strings into ASCII.Zend Framework: Zend_Filter_Transliteration Component Proposal
Proposed Component Name
Zend_Filter_Transliteration
Developer Notes
http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Transliteration
Proposers
Martin Hujer
Alexander Veremyev (Zend Liaison)
Revision
1.0 - 29th January 2008: Initial proposal
1.1 - 8th Aptil 2008: Proposal update (wiki revision: 8)Table of Contents
1. Overview
2. References
3. Component Requirements, Constraints, and Acceptance Criteria
4. Dependencies on Other Framework Components
- Zend_Filter_Interface
5. Theory of Operation
...
6. Milestones / Tasks
- Milestone 1: [DONE] Finalize this proposal
- Milestone 2: [DONE] Working prototype checked into the http://zfdev.googlecode.com/svn/trunk/ZendFilterTransliteration/
- Milestone 3: [DONE] Unit tests exist, work, and are checked into SVN.
- Milestone 4: Community and Zend review
7. Class Index
Zend_Filter_Transliteration
8. Use Cases
19 Comments
comments.show.hideJan 30, 2008
Matthew Weier O'Phinney
<p>I'm worried that some users will confuse Transliteration with Inflector, and use the wrong filter. Perhaps the class name should be something along the lines of Zend_Filter_ReplaceI18nChars or something similar that lay-people will understand?</p>
Jan 30, 2008
Martin Hujer
<p>OK, I'll rename it. But later, I don't want to go through another class renaming dance <ac:emoticon ac:name="smile" /></p>
Jan 30, 2008
Tomas Markauskas
<p>I'm not sure if it's a good idea. Transliteration is the exact description of what this filter will do, so in my opinion it should be called that.</p>
Jan 30, 2008
Martin Hujer
<p>OK, the name 'Transliteration' was suggested me by sb and 'cos I'm not <em>native speaker</em>, I wasn't sure. Now I have looked it up in the dictionary and I'm now sure it's good name.</p>
Feb 06, 2008
Marc Jakubowski
<p>"Transliteration" would imply that it is possible to transliterate into any charset and not just ASCII. But that´s out of its scope right?<br />
So what about "Zend_Filter_Asciify", that should be clear <ac:emoticon ac:name="smile" /></p>
Feb 18, 2008
Martin Hujer
<p>Just added cyrillic transliteration using transliteration table.</p>
<p>Please update from SVN.</p>
<p>Martin H.</p>
Feb 19, 2008
Thomas Weidner
<p>Using this filter I would suggest that it is able to recode</p>
<ul>
<li>ASCII to UTF-8</li>
<li>UTF-8 to ASCII</li>
</ul>
<p>ASCII to UTF-8 is problematic...<br />
And if I use your UTF-8 to ASCII recoding I would have problems with other chars than cyrillic.</p>
<p>IF you want to create a generic filter (what the name suggests) you should also suport ALL UTF-8 characters.</p>
<p>Until then it would be better to have it named like "Zend_Filter_CyrillicToAscii" and so on.<br />
I am not sure if you are able to support all charsets... how would you recode chinese or japanese chars into ascii ?</p>
<p>I am even not sure if it's a good idea to have chars downgraded, because the meaning can change.<br />
In german for example it's sometimes not the same if you use "ß" or "ss".</p>
<p>Thomas<br />
I18N Team Leader</p>
Feb 19, 2008
Martin Hujer
<p>Hi,</p>
<p>I originally wanted to create simple filter, which converts string (e.g. article title) to url (seo friendly).<br />
I needed to convert special Czech chars such as ?š??žýáíé?ú into their equivalents.</p>
<p>It was originally Zend_Filter_SeoUrl and then <ac:link><ri:page ri:content-title="Zend_Filter_Sanitize - Martin Hujer" /></ac:link>.</p>
<p>Finally, I stripped out the transliteration part into this proposal. Somebody on Czech board suggested Cyrillic support, so it is.</p>
<p>I absolutely don't know how to solve this, because I think, that a component for seo url would be usefull, but creating the whole UTF-8 -> ASCII transliteration table is really hard.<br />
Maybe I should split out the CyrillicToAscii part...</p>
<p>I somebody has an idea how to solve this, I'll be happy.</p>
Jul 22, 2008
Pawel Przeradowski
<p>Please consider including transliteration table for your northern neighbours <ac:emoticon ac:name="wink" /></p>
<ac:macro ac:name="code"><ac:default-parameter>php</ac:default-parameter><ac:plain-text-body><![CDATA[
private function _transliteratePolish ($s)
]]></ac:plain-text-body></ac:macro>
Jul 22, 2008
Martin Hujer
<p>Thank you. I've updated code and tests. <ac:emoticon ac:name="smile" /></p>
Jul 22, 2008
Pawel Przeradowski
<p>You are so damn quick.</p>
<p>One more thing I can contribute. These are traditional well established transliteration rules for Danish. (I studied Danish philology once.)</p>
<ac:macro ac:name="code"><ac:default-parameter>php</ac:default-parameter><ac:plain-text-body><![CDATA[
private function _transliterateDanish ($s)
]]></ac:plain-text-body></ac:macro>
Jul 22, 2008
Martin Hujer
<p>Added. Thanks <ac:emoticon ac:name="smile" /></p>
Jul 24, 2008
Goran Juric
<p>Hi,</p>
<p>here are transliteration rules for Croatian. The same rules are used for Serbian (latin), and Slovenian (although Slovenian language doesn't have character '?' ('?')), and maybe for some other as well.</p>
<p>Some of the characters already exist in other alphabets. </p>
<ac:macro ac:name="code"><ac:parameter ac:name="type">php</ac:parameter><ac:plain-text-body><![CDATA[
/**
*
*/
private function _transliterateCroatian ($s)
public function testCroatian ()
]]></ac:plain-text-body></ac:macro>
Jul 24, 2008
Martin Hujer
<p>Hi,</p>
<p>can you please mail it to mhujer <ac:link><ri:page ri:content-title="at" /></ac:link> gmail <ac:link><ri:page ri:content-title="dot" /></ac:link> com as an attachment? Some special chars are displayed here as questions marks.</p>
<p>Thanks! <ac:emoticon ac:name="smile" /></p>
Jul 28, 2008
Martin Hujer
<p>Thanks. I have added it.</p>
Jul 28, 2008
Goran Juric
<p>There is an interesting post on <a class="external-link" href="http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/">http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/</a> on the subject of transliteration.</p>
<p>My main reason for using this ZF component would be to make sure that titles of the articles are transliterated to URIs so I get valid, and readable URIs.</p>
<p>Did you have a look at the <a class="external-link" href="http://derickrethans.nl/translit.php">http://derickrethans.nl/translit.php</a> library mentioned in the articles comments?</p>
Jul 29, 2008
Pádraic Brady
<p>There is a recent (well, a few years recent <ac:emoticon ac:name="wink" />) library for transliteration called UTF8 To ASCII (utf8_to_ascii on Google). It's a simple concept which parses an incoming UTF-8 string, breaks it into UTF-8 characters in hex form, and grabs a sensible hand-picked transliteration from a set of static transliteration tables (just a bunch of PHP arrays). The translit tables are common data (I think Perl has the same set) and the remaining source code, while complex, isn't long. I think it's a really great approach to research and existing data will aid your collection.</p>
<p>I do think Thomas' suggestion is probably not a good one - you should focus only on UTF-8 to ASCII. The reverse is almost impossible unless the UTF-8 to ASCII follows a very specific algorithm (the most common being implemented for IDNs) otherise how could you even tokenise it sensibly? The point afterall is to create sensible, human readable, and agreeable ASCII alternatives for UTF-8 characters for presentation in a URI. If you pardon me for summarising... <ac:emoticon ac:name="wink" /></p>
<p>As for translit/iconv etc. - mixing extensions would lead to significant divergence. No two transliteration strategies are identical which is why I strongly prefer a native solution cross compatible anywhere without needing any specific optional extension unless it's really common.</p>
Aug 05, 2008
Alexander Veremyev
<ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Comments</ac:parameter><ac:rich-text-body>
<p>It looks really useful to have such functionality in the ZF.<br />
API is also clear.</p>
<p>The problem is implementation which doesn't cover all existing languages. That's not good to provide component which works with some inputs and doesn't work with others.</p>
<p>Covering full Unicode range:</p>
<ul>
<li>needs complete transliteration info;</li>
<li>may cause performance problems;</li>
</ul>
<p>On the other hand <a href="http://pecl.php.net/package/intl">intl PECL extension</a> is based on <a href="http://www.icu-project.org/">ICU library</a> which has very comprehensive transliteration functionality.<br />
Hope this functionality will be included into intl.</p>
<p>So proposal is moved to archived section while we don't have full or essential Unicode range covering and don't have pure PHP implementation performance info.<br />
Or while we don't have intl extension with transliteration functionality included into PHP.</p></ac:rich-text-body></ac:macro>
Aug 05, 2008
Thomas Weidner
<p>I just integrated CLDR 1.6.1 for the I18N core in ZF.<br />
I saw that it includes character downgrading informations.</p>
<p>I will have to review it but maybe Zend_Locale is able to give the needed information soon.</p>