ZF-1248: Zend_Filter_Alpha to support filtering characters from given locale


Zend_Filter_Alpha can become a locale aware class.

My idea is as follows:

$locale = new Zend_Locale('ja_JA');
$alpha  = new Zend_Filter_Alpha(); // optionally takes $locale as input

echo $alpha->filter('フィルタチェイン some other characters not in this locale'); // echo's フィルタチェイン 

echo $alpha->filter('フィルタチェイン some english characters'); // echo's someenglishcharacters


Affects version 0.9.2

One thing to note is that when one wants to use default behaviour, based on the metacharacter [:alpha:], he can so by not providing a locale. Eg:

$alpha  = new Zend_Filter_Alpha();
echo $alpha->filter('some string');

I'm in favor of adding something like this, provided it is a new filter, and does not replace the existing {{Zend_Filter_Alpha}} filter.

http://unicode.org/reports/tr35/tr35-2.html - {{Zend_Filter_ExemplarCharacters()}}

{quote}The element provides optional information about characters are in common use in the locale, and information that can be helpful in picking among character encodings that are typically used to transmit data in the language of the locale. It typically only occurs in a language locale, not in a language/territory locale.


This element indicates that normal usage of the language of this locale requires these letters. ("Letter is interpreted broadly, as in the Unicode General Category, and also includes syllabaries and ideographs.) An encoding that cannot encompass at least these letters is inappropriate for encoding data in the language of this locale. The list of characters is in the Unicode Set format, which allows boolean combinations of sets of letters, including those specified by Unicode properties.

The list should not include non-letters: punctuation marks, digits, etc., although this policy may be changed in future versions of this standard.

The letters do not necessarily form a complete set (especially for languages using large character sets, such as CJK). Nor does the list necessarily include letters that are used in common foreign words used in that language. The letters are only the lowercase alternatives, but implicitly include the normal "case-closure": all uppercase and titlecase variants. For the special case of Turkish, the dotted capital I should be included. Sequences of characters that are considered to be a single letter in the alphabet, such as "ch" can be included, using curly braces (e.g., [[a-z{ch}{ll}{rr}] - [w]]). For more information, see [Data Formats].{quote}

Thus, use of the CLDR data to filter UTF8 strings is not entirely synonymous with filtering alphabetic characters common to a locale.

@gavin: The related data is already avaiable through Zend_Locale->getTranslation and ->getTranslationList() ;-)

There is working code available at http://andries.systray.be/Alpha.phps

The way I look at it a locale-based filter won't do much good. In all fairness it would be better than the current solution, but it's still missing one important point: Why would it be illegal for an American to use French accents? Why can't someone from Japan use Latin characters?

I don't know about you, but it think this is not the perfect solution. I'm aware that a filter that works for all locales would be rather slow, but at least it would save users of sites that are built upon the ZF the headache when they're trying to use a foreign word which would then screw up the whole text.

That's why I suggest adding another filter which would allow all alpha characters. Developers could then decide what's more important to them speed or usability.

I already had a conversation with Andries about this and we figured that it might be faster if we checked for alpha chars byte-wise instead of using a super-regex.

I could play around with it and see what the speed is like, but first let's see what others have to say about this.

Just wanted to add that my name is probably the best example where the locale-based filter would fail. If I signed up on a ZF-based site my name would be (forever) stored in the databse as Andr Hoffmann, which would suck big time.

Another similar issue has been post, including a solution. This issue has been resolved. Please see [ZF-1483]