Skip to end of metadata
Go to start of metadata

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Filter_Transliteration Component Proposal

Proposed Component Name Zend_Filter_Transliteration
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Transliteration
Proposers Martin Hujer
Alexander Veremyev (Zend Liaison)
Revision 1.0 - 29th January 2008: Initial proposal
1.1 - 8th Aptil 2008: Proposal update (wiki revision: 8)

Table of Contents

1. Overview

Zend_Filter_Transliteration is a component that transliterates utf8 strings into ASCII.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This filter must correctly transliterate any utf8 string into ASCII version

4. Dependencies on Other Framework Components

  • Zend_Filter_Interface

5. Theory of Operation

...

6. Milestones / Tasks

7. Class Index

Zend_Filter_Transliteration

8. Use Cases

9. Class Skeletons

]]></ac:plain-text-body></ac:macro>

]]></ac:plain-text-body></ac:macro>

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jan 30, 2008

    <p>I'm worried that some users will confuse Transliteration with Inflector, and use the wrong filter. Perhaps the class name should be something along the lines of Zend_Filter_ReplaceI18nChars or something similar that lay-people will understand?</p>

    1. Jan 30, 2008

      <p>OK, I'll rename it. But later, I don't want to go through another class renaming dance <ac:emoticon ac:name="smile" /></p>

    2. Jan 30, 2008

      <p>I'm not sure if it's a good idea. Transliteration is the exact description of what this filter will do, so in my opinion it should be called that.</p>

    3. Jan 30, 2008

      <p>OK, the name 'Transliteration' was suggested me by sb and 'cos I'm not <em>native speaker</em>, I wasn't sure. Now I have looked it up in the dictionary and I'm now sure it's good name.</p>

    4. Feb 06, 2008

      <p>"Transliteration" would imply that it is possible to transliterate into any charset and not just ASCII. But that´s out of its scope right?<br />
      So what about "Zend_Filter_Asciify", that should be clear <ac:emoticon ac:name="smile" /></p>

  2. Feb 18, 2008

    <p>Just added cyrillic transliteration using transliteration table.</p>

    <p>Please update from SVN.</p>

    <p>Martin H.</p>

  3. Feb 19, 2008

    <p>Using this filter I would suggest that it is able to recode</p>
    <ul>
    <li>ASCII to UTF-8</li>
    <li>UTF-8 to ASCII</li>
    </ul>

    <p>ASCII to UTF-8 is problematic...<br />
    And if I use your UTF-8 to ASCII recoding I would have problems with other chars than cyrillic.</p>

    <p>IF you want to create a generic filter (what the name suggests) you should also suport ALL UTF-8 characters.</p>

    <p>Until then it would be better to have it named like "Zend_Filter_CyrillicToAscii" and so on.<br />
    I am not sure if you are able to support all charsets... how would you recode chinese or japanese chars into ascii ?</p>

    <p>I am even not sure if it's a good idea to have chars downgraded, because the meaning can change.<br />
    In german for example it's sometimes not the same if you use "ß" or "ss".</p>

    <p>Thomas<br />
    I18N Team Leader</p>

    1. Feb 19, 2008

      <p>Hi,</p>

      <p>I originally wanted to create simple filter, which converts string (e.g. article title) to url (seo friendly).<br />
      I needed to convert special Czech chars such as ?š??žýáíé?ú into their equivalents.</p>

      <p>It was originally Zend_Filter_SeoUrl and then <ac:link><ri:page ri:content-title="Zend_Filter_Sanitize - Martin Hujer" /></ac:link>.</p>

      <p>Finally, I stripped out the transliteration part into this proposal. Somebody on Czech board suggested Cyrillic support, so it is.</p>

      <p>I absolutely don't know how to solve this, because I think, that a component for seo url would be usefull, but creating the whole UTF-8 -> ASCII transliteration table is really hard.<br />
      Maybe I should split out the CyrillicToAscii part...</p>

      <p>I somebody has an idea how to solve this, I'll be happy.</p>

  4. Jul 22, 2008

    <p>Please consider including transliteration table for your northern neighbours <ac:emoticon ac:name="wink" /></p>

    <ac:macro ac:name="code"><ac:default-parameter>php</ac:default-parameter><ac:plain-text-body><![CDATA[
    private function _transliteratePolish ($s)

    Unknown macro: { $table = array( '?' => 'a', '?' => 'e', 'ó' => 'o', '?' => 'c', '?' => 'l', '?' => 'n', '?' => 's', '?' => 'z', '?' => 'z', 'Ó' => 'O', '?' => 'C', '?' => 'L', '?' => 'S', '?' => 'Z', '?' => 'Z' ); return strtr($s, $table);} public function testTransPolish (){ $this->assertEquals("aoclesnzz", $this->_filter->filter("?ó???????"));} public function testTransPolishSentence (){ $polish = "Pchn?? w t? ?ód? je?a lub o?m skrzy? fig."; $converted = "Pchnac w te lodz jeza lub osm skrzyn fig."; $this->assertEquals($converted, $this->_filter->filter($polish)); }

    ]]></ac:plain-text-body></ac:macro>

    1. Jul 22, 2008

      <p>Thank you. I've updated code and tests. <ac:emoticon ac:name="smile" /></p>

      1. Jul 22, 2008

        <p>You are so damn quick.</p>

        <p>One more thing I can contribute. These are traditional well established transliteration rules for Danish. (I studied Danish philology once.)</p>

        <ac:macro ac:name="code"><ac:default-parameter>php</ac:default-parameter><ac:plain-text-body><![CDATA[
        private function _transliterateDanish ($s)

        Unknown macro: { $table = array( 'æ' => 'ae', 'ø' => 'oe', 'å' => 'aa', 'Æ' => 'Ae', 'Ø' => 'Oe', 'Å' => 'Aa' ); return strtr($s, $table);} public function testTransDanish (){ $this->assertEquals("Aaaeaaoe", $this->_filter->filter("Åæåø"));} public function testTransDanishSentence (){ $danish = "På Falster, i nærheden af Nykøbing."; $converted = "Paa Falster, i naerheden af Nykoebing."; $this->assertEquals($converted, $this->_filter->filter($danish)); }

        ]]></ac:plain-text-body></ac:macro>

        1. Jul 22, 2008

          <p>Added. Thanks <ac:emoticon ac:name="smile" /></p>

  5. Jul 24, 2008

    <p>Hi,</p>

    <p>here are transliteration rules for Croatian. The same rules are used for Serbian (latin), and Slovenian (although Slovenian language doesn't have character '?' ('?')), and maybe for some other as well.</p>

    <p>Some of the characters already exist in other alphabets. </p>

    <ac:macro ac:name="code"><ac:parameter ac:name="type">php</ac:parameter><ac:plain-text-body><![CDATA[
    /**

    • Transliterate Croatian chars
      *
    • @param string $s
    • @return string
      */
      private function _transliterateCroatian ($s)
      Unknown macro: { $table = array ( '?' => 'C', '?' => 'C', 'Ž' => 'Z', 'Š' => 'S', '?' => 'D', '?' => 'c', '?' => 'c', 'ž' => 'z', 'š' => 's', '?' => 'd', ); return strtr($s, $table); }

    public function testCroatian ()

    Unknown macro: { $this->assertEquals("cczsdCCZSD", $this->_filter->filter("??žš???ŽŠ?")); }

    ]]></ac:plain-text-body></ac:macro>

    1. Jul 24, 2008

      <p>Hi,</p>

      <p>can you please mail it to mhujer <ac:link><ri:page ri:content-title="at" /></ac:link> gmail <ac:link><ri:page ri:content-title="dot" /></ac:link> com as an attachment? Some special chars are displayed here as questions marks.</p>

      <p>Thanks! <ac:emoticon ac:name="smile" /></p>

      1. Jul 28, 2008

        <p>Thanks. I have added it.</p>

  6. Jul 28, 2008

    <p>There is an interesting post on <a class="external-link" href="http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/">http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/</a> on the subject of transliteration.</p>

    <p>My main reason for using this ZF component would be to make sure that titles of the articles are transliterated to URIs so I get valid, and readable URIs.</p>

    <p>Did you have a look at the <a class="external-link" href="http://derickrethans.nl/translit.php">http://derickrethans.nl/translit.php</a> library mentioned in the articles comments?</p>

  7. Jul 29, 2008

    <p>There is a recent (well, a few years recent <ac:emoticon ac:name="wink" />) library for transliteration called UTF8 To ASCII (utf8_to_ascii on Google). It's a simple concept which parses an incoming UTF-8 string, breaks it into UTF-8 characters in hex form, and grabs a sensible hand-picked transliteration from a set of static transliteration tables (just a bunch of PHP arrays). The translit tables are common data (I think Perl has the same set) and the remaining source code, while complex, isn't long. I think it's a really great approach to research and existing data will aid your collection.</p>

    <p>I do think Thomas' suggestion is probably not a good one - you should focus only on UTF-8 to ASCII. The reverse is almost impossible unless the UTF-8 to ASCII follows a very specific algorithm (the most common being implemented for IDNs) otherise how could you even tokenise it sensibly? The point afterall is to create sensible, human readable, and agreeable ASCII alternatives for UTF-8 characters for presentation in a URI. If you pardon me for summarising... <ac:emoticon ac:name="wink" /></p>

    <p>As for translit/iconv etc. - mixing extensions would lead to significant divergence. No two transliteration strategies are identical which is why I strongly prefer a native solution cross compatible anywhere without needing any specific optional extension unless it's really common.</p>

  8. Aug 05, 2008

    <ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Comments</ac:parameter><ac:rich-text-body>
    <p>It looks really useful to have such functionality in the ZF.<br />
    API is also clear.</p>

    <p>The problem is implementation which doesn't cover all existing languages. That's not good to provide component which works with some inputs and doesn't work with others.</p>

    <p>Covering full Unicode range:</p>
    <ul>
    <li>needs complete transliteration info;</li>
    <li>may cause performance problems;</li>
    </ul>

    <p>On the other hand <a href="http://pecl.php.net/package/intl">intl PECL extension</a> is based on <a href="http://www.icu-project.org/">ICU library</a> which has very comprehensive transliteration functionality.<br />
    Hope this functionality will be included into intl.</p>

    <p>So proposal is moved to archived section while we don't have full or essential Unicode range covering and don't have pure PHP implementation performance info.<br />
    Or while we don't have intl extension with transliteration functionality included into PHP.</p></ac:rich-text-body></ac:macro>

    1. Aug 05, 2008

      <p>I just integrated CLDR 1.6.1 for the I18N core in ZF.<br />
      I saw that it includes character downgrading informations.</p>

      <p>I will have to review it but maybe Zend_Locale is able to give the needed information soon.</p>