Zend Framework

Tokenizer does not support UTF-8 - Potential way to fix

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 0.1.3, 0.1.4, 0.1.5
  • Fix Version/s: 1.5.0
  • Component/s: Zend_Search_Lucene
  • Labels:
    None
  • Fix Version Priority:
    Should Have

Description

Quoting documentation : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future."

PCRE 5+ (6.2 is bundled in PHP 5.1) supports Unicode general categories which means you can have a Unicode version of ctype_alpha.

If \p{L} is provided in a match pattern, it matches a character that is a letter is one of these languages:

Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana-
dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret,
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati,
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada,
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Yi.

preg_match("/\p{L}/", $c) would do the job repeated for every $c character of the string to be tokenized. It may be a performance hog, but if you play with these Unicode modifiers in preg_*, you should be able to figure out something efficient.

http://www.pcre.org/pcre.txt

Issue Links

Activity

Hide
Gavin added a comment -

Alexander Veremyev wrote:

Interesting idea is offered by Vincent - ZF-249

I have tested it a bit with Latin (including umlauts) and Cyrillic letters.
It works properly on Linux and has some problems under Windows.

Show
Gavin added a comment - Alexander Veremyev wrote:
Interesting idea is offered by Vincent - ZF-249 I have tested it a bit with Latin (including umlauts) and Cyrillic letters. It works properly on Linux and has some problems under Windows.
Hide
Alexander Veremyev added a comment -

I've tested offered solution under Linux/Windows. It doesn't work correct for all cases.

Issue is closed with current encoding management support and new utf-8 anslyzer.

Show
Alexander Veremyev added a comment - I've tested offered solution under Linux/Windows. It doesn't work correct for all cases. Issue is closed with current encoding management support and new utf-8 anslyzer.
Hide
Alexander Veremyev added a comment -

preg_match() works as it expected if 'u' modifier is set at the end of pattern.

$c = 'ù';
echo preg_match('/\p{L}/u', $c); // 1

$c = 'ы';
echo preg_match('/\p{L}/u', $c); // 1

$c = '!';
echo preg_match('/\p{L}/u', $c); // 0

According to the PCRE specification '/[\p{L}\p{N}]/u' patter has to be used to match alpha-numeric characters.

It's definitely a way to implement complete UTF-8 analyzer.

Show
Alexander Veremyev added a comment - preg_match() works as it expected if 'u' modifier is set at the end of pattern.
$c = 'ù';
echo preg_match('/\p{L}/u', $c); // 1

$c = 'ы';
echo preg_match('/\p{L}/u', $c); // 1

$c = '!';
echo preg_match('/\p{L}/u', $c); // 0
According to the PCRE specification '/[\p{L}\p{N}]/u' patter has to be used to match alpha-numeric characters. It's definitely a way to implement complete UTF-8 analyzer.
Hide
Alexander Veremyev added a comment -

Done.

Show
Alexander Veremyev added a comment - Done.

People

Vote (1)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved:

Time Tracking

Estimated:
4h
Original Estimate - 4 hours Original Estimate - 4 hours
Remaining:
0m
Remaining Estimate - 0 minutes
Logged:
1d
Time Spent - 1 day