ZF-249: Tokenizer does not support UTF-8 - Potential way to fix

Issue Type: Bug Created: 2006-07-13T08:24:44.000+0000 Last Updated: 2008-03-21T16:25:40.000+0000 Status: Resolved Fix version(s): - 1.5.0 (17/Mar/08)

Reporter: Vincent Bouret (vbouret) Assignee: Alexander Veremyev (alexander) Tags: - Zend_Search_Lucene

Related issues: - ZF-550



Quoting documentation : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future."

PCRE 5+ (6.2 is bundled in PHP 5.1) supports Unicode general categories which means you can have a Unicode version of ctype_alpha.

If \p{L} is provided in a match pattern, it matches a character that is a letter is one of these languages:

   Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-
   dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,
   Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,
   Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,
   Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,
   Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
   Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
   banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
   Ugaritic, Yi.

preg_match("/\p{L}/", $c) would do the job repeated for every $c character of the string to be tokenized. It may be a performance hog, but if you play with these Unicode modifiers in preg_*, you should be able to figure out something efficient.


Posted by Gavin (gavin) on 2006-11-13T17:03:02.000+0000

Alexander Veremyev wrote: {quote}Interesting idea is offered by Vincent - ZF-249

I have tested it a bit with Latin (including umlauts) and Cyrillic letters. It works properly on Linux and has some problems under Windows. {quote}

Posted by Alexander Veremyev (alexander) on 2007-01-24T19:45:47.000+0000

I've tested offered solution under Linux/Windows. It doesn't work correct for all cases.

Issue is closed with current encoding management support and new utf-8 anslyzer.

Posted by Alexander Veremyev (alexander) on 2007-07-12T00:25:56.000+0000

preg_match() works as it expected if 'u' modifier is set at the end of pattern.

<pre class="highlight">
$c = 'ù';
echo preg_match('/\p{L}/u', $c); // 1

$c = 'ы';
echo preg_match('/\p{L}/u', $c); // 1

$c = '!';
echo preg_match('/\p{L}/u', $c); // 0

According to the" rel="nofollow">PCRE specification '/[\p{L}\p{N}]/u' patter has to be used to match alpha-numeric characters.

It's definitely a way to implement complete UTF-8 analyzer.

Posted by Alexander Veremyev (alexander) on 2008-01-31T17:23:26.000+0000


Have you found an issue?

See the Overview section for more details.


© 2006-2018 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.