Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 0.1.3, 0.1.4, 0.1.5
-
Fix Version/s: 1.5.0
-
Component/s: Zend_Search_Lucene
-
Labels:None
-
Fix Version Priority:Should Have
Description
Quoting documentation : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future."
PCRE 5+ (6.2 is bundled in PHP 5.1) supports Unicode general categories which means you can have a Unicode version of ctype_alpha.
If \p{L} is provided in a match pattern, it matches a character that is a letter is one of these languages:
Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana-
dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret,
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati,
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada,
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Yi.
preg_match("/\p{L}/", $c) would do the job repeated for every $c character of the string to be tokenized. It may be a performance hog, but if you play with these Unicode modifiers in preg_*, you should be able to figure out something efficient.
Issue Links
| This issue is dependecy of: | ||||
| ZF-550 | Possible Alternative for ctype_alpha() |
|
|
|
Alexander Veremyev wrote: