ZF-249: Tokenizer does not support UTF-8 - Potential way to fix


Quoting documentation : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future."

PCRE 5+ (6.2 is bundled in PHP 5.1) supports Unicode general categories which means you can have a Unicode version of ctype_alpha.

If \p{L} is provided in a match pattern, it matches a character that is a letter is one of these languages:

   Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-
   dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,
   Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,
   Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,
   Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,
   Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
   Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
   banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
   Ugaritic, Yi.

preg_match("/\p{L}/", $c) would do the job repeated for every $c character of the string to be tokenized. It may be a performance hog, but if you play with these Unicode modifiers in preg_*, you should be able to figure out something efficient.


Alexander Veremyev wrote: {quote}Interesting idea is offered by Vincent - ZF-249

I have tested it a bit with Latin (including umlauts) and Cyrillic letters. It works properly on Linux and has some problems under Windows. {quote}

I've tested offered solution under Linux/Windows. It doesn't work correct for all cases.

Issue is closed with current encoding management support and new utf-8 anslyzer.

preg_match() works as it expected if 'u' modifier is set at the end of pattern.

$c = 'ù';
echo preg_match('/\p{L}/u', $c); // 1

$c = 'ы';
echo preg_match('/\p{L}/u', $c); // 1

$c = '!';
echo preg_match('/\p{L}/u', $c); // 0

According to the" rel="nofollow">PCRE specification '/[\p{L}\p{N}]/u' patter has to be used to match alpha-numeric characters.

It's definitely a way to implement complete UTF-8 analyzer.