ZF-249: Tokenizer does not support UTF-8 - Potential way to fix
Description
Quoting documentation : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future."
PCRE 5+ (6.2 is bundled in PHP 5.1) supports Unicode general categories which means you can have a Unicode version of ctype_alpha.
If \p{L} is provided in a match pattern, it matches a character that is a letter is one of these languages:
Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana-
dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret,
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati,
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada,
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Yi.
preg_match("/\p{L}/", $c) would do the job repeated for every $c character of the string to be tokenized. It may be a performance hog, but if you play with these Unicode modifiers in preg_*, you should be able to figure out something efficient.
Comments
Posted by Gavin (gavin) on 2006-11-13T17:03:02.000+0000
Alexander Veremyev wrote: {quote}Interesting idea is offered by Vincent - ZF-249
I have tested it a bit with Latin (including umlauts) and Cyrillic letters. It works properly on Linux and has some problems under Windows. {quote}
Posted by Alexander Veremyev (alexander) on 2007-01-24T19:45:47.000+0000
I've tested offered solution under Linux/Windows. It doesn't work correct for all cases.
Issue is closed with current encoding management support and new utf-8 anslyzer.
Posted by Alexander Veremyev (alexander) on 2007-07-12T00:25:56.000+0000
preg_match() works as it expected if 'u' modifier is set at the end of pattern.
According to the www.pcre.org/pcre.txt" rel="nofollow">PCRE specification '/[\p{L}\p{N}]/u' patter has to be used to match alpha-numeric characters.
It's definitely a way to implement complete UTF-8 analyzer.
Posted by Alexander Veremyev (alexander) on 2008-01-31T17:23:26.000+0000
Done.