Labels
Zend Framework: Zend_Filter_Transliteration Component Proposal
| Proposed Component Name | Zend_Filter_Transliteration |
|---|---|
| Developer Notes | http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Transliteration |
| Proposers | Martin Hujer Alexander Veremyev (Zend Liaison) |
| Revision | 1.0 - 29th January 2008: Initial proposal 1.1 - 8th Aptil 2008: Proposal update (wiki revision: 7) |
Table of Contents
1. Overview
Zend_Filter_Transliteration is a component that transliterates utf8 strings into ASCII.
2. References
3. Component Requirements, Constraints, and Acceptance Criteria
- This filter must correctly transliterate any utf8 string into ASCII version
4. Dependencies on Other Framework Components
- Zend_Filter_Interface
5. Theory of Operation
...
6. Milestones / Tasks
- Milestone 1: [DONE] Finalize this proposal
- Milestone 2: [DONE] Working prototype checked into the http://zfdev.googlecode.com/svn/trunk/ZendFilterTransliteration/
- Milestone 3: [DONE] Unit tests exist, work, and are checked into SVN.
- Milestone 4: Community and Zend review
7. Class Index
Zend_Filter_Transliteration
8. Use Cases
9. Class Skeletons
I'm not sure if it's a good idea. Transliteration is the exact description of what this filter will do, so in my opinion it should be called that.
OK, the name 'Transliteration' was suggested me by sb and 'cos I'm not native speaker, I wasn't sure. Now I have looked it up in the dictionary and I'm now sure it's good name.
Using this filter I would suggest that it is able to recode
- ASCII to UTF-8
- UTF-8 to ASCII
ASCII to UTF-8 is problematic...
And if I use your UTF-8 to ASCII recoding I would have problems with other chars than cyrillic.
IF you want to create a generic filter (what the name suggests) you should also suport ALL UTF-8 characters.
Until then it would be better to have it named like "Zend_Filter_CyrillicToAscii" and so on.
I am not sure if you are able to support all charsets... how would you recode chinese or japanese chars into ascii ?
I am even not sure if it's a good idea to have chars downgraded, because the meaning can change.
In german for example it's sometimes not the same if you use "ß" or "ss".
Thomas
I18N Team Leader
Hi,
I originally wanted to create simple filter, which converts string (e.g. article title) to url (seo friendly).
I needed to convert special Czech chars such as ?š??žýáíé?ú into their equivalents.
It was originally Zend_Filter_SeoUrl and then Zend_Filter_Sanitize - Martin Hujer.
Finally, I stripped out the transliteration part into this proposal. Somebody on Czech board suggested Cyrillic support, so it is.
I absolutely don't know how to solve this, because I think, that a component for seo url would be usefull, but creating the whole UTF-8 -> ASCII transliteration table is really hard.
Maybe I should split out the CyrillicToAscii part...
I somebody has an idea how to solve this, I'll be happy.
Please consider including transliteration table for your northern neighbours ![]()
Hi,
here are transliteration rules for Croatian. The same rules are used for Serbian (latin), and Slovenian (although Slovenian language doesn't have character '?' ('?')), and maybe for some other as well.
Some of the characters already exist in other alphabets.
There is an interesting post on http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/ on the subject of transliteration.
My main reason for using this ZF component would be to make sure that titles of the articles are transliterated to URIs so I get valid, and readable URIs.
Did you have a look at the http://derickrethans.nl/translit.php library mentioned in the articles comments?
There is a recent (well, a few years recent
) library for transliteration called UTF8 To ASCII (utf8_to_ascii on Google). It's a simple concept which parses an incoming UTF-8 string, breaks it into UTF-8 characters in hex form, and grabs a sensible hand-picked transliteration from a set of static transliteration tables (just a bunch of PHP arrays). The translit tables are common data (I think Perl has the same set) and the remaining source code, while complex, isn't long. I think it's a really great approach to research and existing data will aid your collection.
I do think Thomas' suggestion is probably not a good one - you should focus only on UTF-8 to ASCII. The reverse is almost impossible unless the UTF-8 to ASCII follows a very specific algorithm (the most common being implemented for IDNs) otherise how could you even tokenise it sensibly? The point afterall is to create sensible, human readable, and agreeable ASCII alternatives for UTF-8 characters for presentation in a URI. If you pardon me for summarising... ![]()
As for translit/iconv etc. - mixing extensions would lead to significant divergence. No two transliteration strategies are identical which is why I strongly prefer a native solution cross compatible anywhere without needing any specific optional extension unless it's really common.
| Zend Comments It looks really useful to have such functionality in the ZF. The problem is implementation which doesn't cover all existing languages. That's not good to provide component which works with some inputs and doesn't work with others. Covering full Unicode range:
On the other hand intl PECL extension is based on ICU library which has very comprehensive transliteration functionality. So proposal is moved to archived section while we don't have full or essential Unicode range covering and don't have pure PHP implementation performance info. |
ZF Home Page
Code Browser
Wiki Dashboard
I'm worried that some users will confuse Transliteration with Inflector, and use the wrong filter. Perhaps the class name should be something along the lines of Zend_Filter_ReplaceI18nChars or something similar that lay-people will understand?