ZF-76: Zend_Search_Lucene don't index polish chars in UTF-8 (TRAC#83)

Description

http://framework.zend.com/developer/ticket/83

FILE ATTACHED IN TRAC

Zend_Search_Lucene don't index polish chars in UTF-8. The exception is thrown when it gets for example "ś" char (0x015B). Exception:


Zend_Search_Lucene_Exception::__set_state(array(
   'message' => 'Invalid UTF-8 string',
   'string' => '',
   'code' => 0,
   'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Storage\\File.php',
   'line' => 354,
   'trace' => 
  array (
    0 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\SegmentWriter.php',
      'line' => 316,
      'function' => 'writeString',
      'class' => 'Zend_Search_Lucene_Storage_File',
      'type' => '->',
      'args' => 
      array (
        0 => '�',
      ),
    ),
    1 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\SegmentWriter.php',
      'line' => 417,
      'function' => '_dumpTermDictEntry',
      'class' => 'Zend_Search_Lucene_Index_SegmentWriter',
      'type' => '->',
      'args' => 
      array (
        0 => 
        Zend_Search_Lucene_Storage_File_Filesystem::__set_state(array(
           '_fileHandle' => NULL,
        )),
        1 => 
        Zend_Search_Lucene_Index_Term::__set_state(array(
           'field' => 1,
           'text' => 'tests',
        )),
        2 => 
        Zend_Search_Lucene_Index_Term::__set_state(array(
           'field' => 1,
           'text' => '�',
        )),
        3 => 
        Zend_Search_Lucene_Index_TermInfo::__set_state(array(
           'docFreq' => 1,
           'freqPointer' => 7,
           'proxPointer' => 7,
           'skipOffset' => 0,
           'indexPointer' => NULL,
        )),
        4 => 
        Zend_Search_Lucene_Index_TermInfo::__set_state(array(
           'docFreq' => 1,
           'freqPointer' => 8,
           'proxPointer' => 8,
           'skipOffset' => 0,
           'indexPointer' => NULL,
        )),
      ),
    ),
    2 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\SegmentWriter.php',
      'line' => 481,
      'function' => '_dumpDictionary',
      'class' => 'Zend_Search_Lucene_Index_SegmentWriter',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
    3 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\Writer.php',
      'line' => 228,
      'function' => 'close',
      'class' => 'Zend_Search_Lucene_Index_SegmentWriter',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
    4 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene.php',
      'line' => 507,
      'function' => 'commit',
      'class' => 'Zend_Search_Lucene_Index_Writer',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
    5 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\zend_framework_tests\\zend_search.php',
      'line' => 27,
      'function' => 'commit',
      'class' => 'Zend_Search_Lucene',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
  ),
))

Fatal error: Exception thrown without a stack frame in Unknown on line 0

Note: both console and data are in UTF-8 encoding.

Comments

05/23/06 06:06:57: Modified by dominik@bulaj.com

* attachment zend_search.php added.

Example file doesn't index its contents by Zend_Search 05/27/06 04:33:56: Modified by epaulin AT gmail DOT com

hi, here is a UTF-8 patch made by Hightman, it does works for me(Text Fileds lenth seems limited, probably it's a ZendSearch? defect).

http://php.twomice.net/xfile2.php/…

I assume this is the issue mentioned in the documentation in section 14.5.1. : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future.".

This IMO makes Zend_Search_Lucene currently more or less unusable for languages that go beyond plain ASCII. In German for example umlauts (an 'a' with two dots on top of it) are replaced by 'double-quote vowel' , when you convert it to ASCII//TRANSLIT as proposed as a workaround.

BTW: The URL of the patch mentioned above is currently broken.

hi, Georg Yeah, the URL is broken. The guy who did this patch say "the patch release under open source", but he didn't say which license it is, so i just give the url above. If you instersted at this patch, drop me a mail, epaulin AT gmail DOT com

http://www.blogcityusa.com/blog/desa/687/ - http://www.blogcityusa.com/blog/desa/688/ - http://www.blogcityusa.com/blog/desa/689/ - http://www.blogcityusa.com/blog/desa/690/ - http://www.blogcityusa.com/blog/desa/691/ - http://www.blogcityusa.com/blog/desa/692/ - http://www.blogcityusa.com/blog/desa/693/ - http://www.blogcityusa.com/blog/desa/694/ - http://www.blogcityusa.com/blog/desa/695/ - http://www.blogcityusa.com/blog/desa/696/ - http://www.blogcityusa.com/blog/desa/697/ - http://www.blogcityusa.com/blog/desa/698/ - http://www.blogcityusa.com/blog/desa/699/ - http://www.blogcityusa.com/blog/desa/700/ - http://www.blogcityusa.com/blog/desa/701/ - http://www.blogcityusa.com/blog/desa/702/ - http://www.blogcityusa.com/blog/desa/703/ - http://www.blogcityusa.com/blog/desa/704/ - http://www.blogcityusa.com/blog/desa/705/ - http://www.blogcityusa.com/blog/desa/706/ - http://www.blogcityusa.com/blog/desa/707/ - http://www.blogcityusa.com/blog/desa/708/ - http://www.blogcityusa.com/blog/desa/709/ - http://www.blogcityusa.com/blog/desa/710/ - http://www.blogcityusa.com/blog/desa/711/ - http://www.blogcityusa.com/blog/desa/712/ - http://www.blogcityusa.com/blog/desa/713/ - http://www.blogcityusa.com/blog/desa/714/ - http://www.blogcityusa.com/blog/desa/715/ - http://www.blogcityusa.com/blog/desa/716/ - http://www.blogcityusa.com/blog/desa/717/ - http://www.blogcityusa.com/blog/desa/718/ - http://www.blogcityusa.com/blog/desa/719/ - http://www.blogcityusa.com/blog/desa/720/ - http://www.blogcityusa.com/blog/desa/721/ - http://www.blogcityusa.com/blog/desa/722/ - http://www.blogcityusa.com/blog/desa/723/ - http://www.blogcityusa.com/blog/desa/724/ - http://www.blogcityusa.com/blog/desa/725/ - http://www.blogcityusa.com/blog/desa/726/ - http://www.blogcityusa.com/blog/desa/727/ - http://www.blogcityusa.com/blog/desa/728/ - http://www.blogcityusa.com/blog/desa/729/ - http://www.blogcityusa.com/blog/desa/730/ - http://www.blogcityusa.com/blog/desa/731/ - http://www.blogcityusa.com/blog/desa/732/ - http://www.blogcityusa.com/blog/desa/733/ - http://www.blogcityusa.com/blog/desa/734/ - http://www.blogcityusa.com/blog/desa/735/ - http://www.blogcityusa.com/blog/desa/736/ - http://www.blogcityusa.com/blog/desa/737/ - http://www.blogcityusa.com/blog/desa/738/ - http://www.blogcityusa.com/blog/desa/739/ - http://www.blogcityusa.com/blog/desa/740/ - http://www.blogcityusa.com/blog/desa/741/ - http://www.blogcityusa.com/blog/desa/742/ - http://www.blogcityusa.com/blog/desa/743/ - http://www.blogcityusa.com/blog/desa/744/ - http://www.blogcityusa.com/blog/desa/745/ - http://www.blogcityusa.com/blog/desa/746/ - http://www.blogcityusa.com/blog/desa/747/ - http://www.blogcityusa.com/blog/desa/748/ - http://www.blogcityusa.com/blog/desa/749/ - http://www.blogcityusa.com/blog/desa/750/ - http://www.blogcityusa.com/blog/desa/751/ - http://www.blogcityusa.com/blog/desa/752/ - http://www.blogcityusa.com/blog/desa/753/ - http://www.blogcityusa.com/blog/desa/754/ - http://www.blogcityusa.com/blog/desa/755/ - http://www.blogcityusa.com/blog/desa/756/ - http://www.blogcityusa.com/blog/desa/757/ - http://www.blogcityusa.com/blog/desa/758/ - http://www.blogcityusa.com/blog/desa/759/ - http://www.blogcityusa.com/blog/desa/760/ - http://www.blogcityusa.com/blog/desa/761/ - http://www.blogcityusa.com/blog/desa/762/ - http://www.blogcityusa.com/blog/desa/763/ - http://www.blogcityusa.com/blog/desa/764/ - http://www.blogcityusa.com/blog/desa/765/ - http://www.blogcityusa.com/blog/desa/766/ - http://www.blogcityusa.com/blog/desa/767/ - http://www.blogcityusa.com/blog/desa/768/ - http://www.blogcityusa.com/blog/desa/769/ - http://www.blogcityusa.com/blog/desa/770/ - http://www.blogcityusa.com/blog/desa/771/ - http://www.blogcityusa.com/blog/desa/772/ - http://www.blogcityusa.com/blog/desa/773/ - http://www.blogcityusa.com/blog/desa/774/ - http://www.blogcityusa.com/blog/desa/775/ - http://www.blogcityusa.com/blog/desa/776/ - http://www.blogcityusa.com/blog/desa/777/ - http://www.blogcityusa.com/blog/desa/778/ - http://www.blogcityusa.com/blog/desa/779/ - http://www.blogcityusa.com/blog/desa/780/ - http://www.blogcityusa.com/blog/desa/781/ - http://www.blogcityusa.com/blog/desa/782/ - http://www.blogcityusa.com/blog/desa/783/ - http://www.blogcityusa.com/blog/desa/784/ - http://www.blogcityusa.com/blog/desa/785/ - http://www.blogcityusa.com/blog/desa/786/ - http://www.blogcityusa.com/blog/desa/787/ - http://www.blogcityusa.com/blog/desa/788/ - http://www.blogcityusa.com/blog/desa/789/ - http://www.blogcityusa.com/blog/desa/790/ - http://www.blogcityusa.com/blog/desa/791/ - http://www.blogcityusa.com/blog/desa/792/ - http://www.blogcityusa.com/blog/desa/793/ - http://www.blogcityusa.com/blog/desa/794/ - http://www.blogcityusa.com/blog/desa/795/ - http://www.blogcityusa.com/blog/desa/796/ - http://www.blogcityusa.com/blog/desa/797/ - http://www.blogcityusa.com/blog/desa/798/ - http://www.blogcityusa.com/blog/desa/799/ - http://www.blogcityusa.com/blog/desa/800/ - http://www.blogcityusa.com/blog/desa/801/ - http://www.blogcityusa.com/blog/desa/802/ - http://www.blogcityusa.com/blog/desa/803/ - http://www.blogcityusa.com/blog/desa/804/ - http://www.blogcityusa.com/blog/desa/805/ - http://www.blogcityusa.com/blog/desa/806/ - http://www.blogcityusa.com/blog/desa/807/ - http://www.blogcityusa.com/blog/desa/808/ - http://www.blogcityusa.com/blog/desa/809/ - http://www.blogcityusa.com/blog/desa/810/ - http://www.blogcityusa.com/blog/desa/811/ - http://www.blogcityusa.com/blog/desa/812/ - http://www.blogcityusa.com/blog/desa/813/ - http://www.blogcityusa.com/blog/desa/814/ - http://www.blogcityusa.com/blog/desa/815/ - http://www.blogcityusa.com/blog/desa/816/ - http://www.blogcityusa.com/blog/desa/817/ - http://www.blogcityusa.com/blog/desa/818/ - http://www.blogcityusa.com/blog/desa/819/ - http://www.blogcityusa.com/blog/desa/820/ - http://www.blogcityusa.com/blog/desa/821/ - http://www.blogcityusa.com/blog/desa/822/ - http://www.blogcityusa.com/blog/desa/823/ - http://www.blogcityusa.com/blog/desa/824/ - http://www.blogcityusa.com/blog/desa/825/ - http://www.blogcityusa.com/blog/desa/826/ - http://www.blogcityusa.com/blog/desa/827/ - http://www.blogcityusa.com/blog/desa/828/ - http://www.blogcityusa.com/blog/desa/829/ - http://www.blogcityusa.com/blog/desa/830/ - http://www.blogcityusa.com/blog/desa/831/ - http://www.blogcityusa.com/blog/desa/832/ - http://www.blogcityusa.com/blog/desa/833/ - http://www.blogcityusa.com/blog/desa/834/ - http://www.blogcityusa.com/blog/desa/835/ - http://www.blogcityusa.com/blog/desa/836/ - http://www.blogcityusa.com/blog/desa/837/ - http://www.blogcityusa.com/blog/desa/838/ - http://www.blogcityusa.com/blog/desa/839/ - http://www.blogcityusa.com/blog/desa/840/ - http://www.blogcityusa.com/blog/desa/841/ - http://www.blogcityusa.com/blog/desa/842/ - http://www.blogcityusa.com/blog/desa/843/ - http://www.blogcityusa.com/blog/desa/844/ - http://www.blogcityusa.com/blog/desa/845/ - http://www.blogcityusa.com/blog/desa/846/ - http://www.blogcityusa.com/blog/desa/847/ - http://www.blogcityusa.com/blog/desa/848/ - http://www.blogcityusa.com/blog/desa/849/ - http://www.blogcityusa.com/blog/desa/850/ - http://www.blogcityusa.com/blog/desa/851/ - http://www.blogcityusa.com/blog/desa/852/ - http://www.blogcityusa.com/blog/desa/853/ - http://www.blogcityusa.com/blog/desa/854/ - http://www.blogcityusa.com/blog/desa/855/ - http://www.blogcityusa.com/blog/desa/856/ - http://www.blogcityusa.com/blog/desa/857/ - http://www.blogcityusa.com/blog/desa/858/ - http://www.blogcityusa.com/blog/desa/859/ - http://www.blogcityusa.com/blog/desa/860/ - http://www.blogcityusa.com/blog/desa/861/ - http://www.blogcityusa.com/blog/desa/862/ - http://www.blogcityusa.com/blog/desa/863/ - http://www.blogcityusa.com/blog/desa/864/ - http://www.blogcityusa.com/blog/desa/865/ - http://www.blogcityusa.com/blog/desa/866/ - http://www.blogcityusa.com/blog/desa/867/ - http://www.blogcityusa.com/blog/desa/868/ - http://www.blogcityusa.com/blog/desa/869/ - http://www.blogcityusa.com/blog/desa/870/ - http://www.blogcityusa.com/blog/desa/871/ - http://www.blogcityusa.com/blog/desa/872/ - http://www.blogcityusa.com/blog/desa/873/ - http://www.blogcityusa.com/blog/desa/874/ - http://www.blogcityusa.com/blog/desa/875/ - http://www.blogcityusa.com/blog/desa/876/ - http://www.blogcityusa.com/blog/desa/877/ - http://www.blogcityusa.com/blog/desa/878/ - http://www.blogcityusa.com/blog/desa/879/ - http://www.blogcityusa.com/blog/desa/880/ - http://www.blogcityusa.com/blog/desa/881/ - http://www.blogcityusa.com/blog/desa/882/ - http://www.blogcityusa.com/blog/desa/883/ - http://www.blogcityusa.com/blog/desa/884/ - http://www.blogcityusa.com/blog/desa/885/ - http://www.blogcityusa.com/blog/desa/886/

Changing fix version to unknown.

Changing fix version to 0.6.0.

Issue is resolved with current encoding management support and new utf-8 anslyzer.