Issues

ZF-76: Zend_Search_Lucene don't index polish chars in UTF-8 (TRAC#83)

Issue Type: Bug Created: 2006-06-20T23:07:30.000+0000 Last Updated: 2007-07-05T14:43:08.000+0000 Status: Resolved Fix version(s): - 0.8.0 (21/Feb/07)

Reporter: Zend Framework (zend_framework) Assignee: Alexander Veremyev (alexander) Tags: - Zend_Search_Lucene

Related issues: Attachments: - zend_search.php

Description

http://framework.zend.com/developer/ticket/83

FILE ATTACHED IN TRAC

Zend_Search_Lucene don't index polish chars in UTF-8. The exception is thrown when it gets for example "ś" char (0x015B). Exception:

<pre class="highlight">
Zend_Search_Lucene_Exception::__set_state(array(
   'message' => 'Invalid UTF-8 string',
   'string' => '',
   'code' => 0,
   'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Storage\\File.php',
   'line' => 354,
   'trace' => 
  array (
    0 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\SegmentWriter.php',
      'line' => 316,
      'function' => 'writeString',
      'class' => 'Zend_Search_Lucene_Storage_File',
      'type' => '->',
      'args' => 
      array (
        0 => '�',
      ),
    ),
    1 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\SegmentWriter.php',
      'line' => 417,
      'function' => '_dumpTermDictEntry',
      'class' => 'Zend_Search_Lucene_Index_SegmentWriter',
      'type' => '->',
      'args' => 
      array (
        0 => 
        Zend_Search_Lucene_Storage_File_Filesystem::__set_state(array(
           '_fileHandle' => NULL,
        )),
        1 => 
        Zend_Search_Lucene_Index_Term::__set_state(array(
           'field' => 1,
           'text' => 'tests',
        )),
        2 => 
        Zend_Search_Lucene_Index_Term::__set_state(array(
           'field' => 1,
           'text' => '�',
        )),
        3 => 
        Zend_Search_Lucene_Index_TermInfo::__set_state(array(
           'docFreq' => 1,
           'freqPointer' => 7,
           'proxPointer' => 7,
           'skipOffset' => 0,
           'indexPointer' => NULL,
        )),
        4 => 
        Zend_Search_Lucene_Index_TermInfo::__set_state(array(
           'docFreq' => 1,
           'freqPointer' => 8,
           'proxPointer' => 8,
           'skipOffset' => 0,
           'indexPointer' => NULL,
        )),
      ),
    ),
    2 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\SegmentWriter.php',
      'line' => 481,
      'function' => '_dumpDictionary',
      'class' => 'Zend_Search_Lucene_Index_SegmentWriter',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
    3 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene\\Index\\Writer.php',
      'line' => 228,
      'function' => 'close',
      'class' => 'Zend_Search_Lucene_Index_SegmentWriter',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
    4 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\ZendFramework-0.1.2\\library\\Zend\\Search\\Lucene.php',
      'line' => 507,
      'function' => 'commit',
      'class' => 'Zend_Search_Lucene_Index_Writer',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
    5 => 
    array (
      'file' => 'C:\\Apache2\\htdocs\\zend_framework_tests\\zend_search.php',
      'line' => 27,
      'function' => 'commit',
      'class' => 'Zend_Search_Lucene',
      'type' => '->',
      'args' => 
      array (
      ),
    ),
  ),
))

<br></br><b>Fatal error</b>: Exception thrown without a stack frame in <b>Unknown</b> on line <b>0</b><br></br>Note: both console and data are in UTF-8 encoding.

Comments

Posted by Zend Framework (zend_framework) on 2006-06-20T23:07:44.000+0000

05/23/06 06:06:57: Modified by dominik@bulaj.com

* attachment zend_search.php added.

Example file doesn't index its contents by Zend_Search 05/27/06 04:33:56: Modified by epaulin AT gmail DOT com

hi, here is a UTF-8 patch made by Hightman, it does works for me(Text Fileds lenth seems limited, probably it's a ZendSearch? defect).

http://php.twomice.net/xfile2.php/…

Posted by Georg von der Howen (ghowen) on 2006-07-10T01:29:09.000+0000

I assume this is the issue mentioned in the documentation in section 14.5.1. : "However, text analyzers and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future.".

This IMO makes Zend_Search_Lucene currently more or less unusable for languages that go beyond plain ASCII. In German for example umlauts (an 'a' with two dots on top of it) are replaced by 'double-quote vowel' , when you convert it to ASCII//TRANSLIT as proposed as a workaround.

BTW: The URL of the patch mentioned above is currently broken.

Posted by HaiMing Yin (epaulin) on 2006-07-10T08:08:52.000+0000

hi, Georg Yeah, the URL is broken. The guy who did this patch say "the patch release under open source", but he didn't say which license it is, so i just give the url above. If you instersted at this patch, drop me a mail, epaulin AT gmail DOT com

Posted by mark lewman (bossy) on 2006-11-01T22:29:21.000+0000

http://www.blogcityusa.com/blog/desa/687/ - http://www.blogcityusa.com/blog/desa/688/ - http://www.blogcityusa.com/blog/desa/689/ - http://www.blogcityusa.com/blog/desa/690/ - http://www.blogcityusa.com/blog/desa/691/ - http://www.blogcityusa.com/blog/desa/692/ - http://www.blogcityusa.com/blog/desa/693/ - http://www.blogcityusa.com/blog/desa/694/ - http://www.blogcityusa.com/blog/desa/695/ - http://www.blogcityusa.com/blog/desa/696/ - http://www.blogcityusa.com/blog/desa/697/ - http://www.blogcityusa.com/blog/desa/698/ - http://www.blogcityusa.com/blog/desa/699/ - http://www.blogcityusa.com/blog/desa/700/ - http://www.blogcityusa.com/blog/desa/701/ - http://www.blogcityusa.com/blog/desa/702/ - http://www.blogcityusa.com/blog/desa/703/ - http://www.blogcityusa.com/blog/desa/704/ - http://www.blogcityusa.com/blog/desa/705/ - http://www.blogcityusa.com/blog/desa/706/ - http://www.blogcityusa.com/blog/desa/707/ - http://www.blogcityusa.com/blog/desa/708/ - http://www.blogcityusa.com/blog/desa/709/ - http://www.blogcityusa.com/blog/desa/710/ - http://www.blogcityusa.com/blog/desa/711/ - http://www.blogcityusa.com/blog/desa/712/ - http://www.blogcityusa.com/blog/desa/713/ - http://www.blogcityusa.com/blog/desa/714/ - http://www.blogcityusa.com/blog/desa/715/ - http://www.blogcityusa.com/blog/desa/716/ - http://www.blogcityusa.com/blog/desa/717/ - http://www.blogcityusa.com/blog/desa/718/ - http://www.blogcityusa.com/blog/desa/719/ - http://www.blogcityusa.com/blog/desa/720/ - http://www.blogcityusa.com/blog/desa/721/ - http://www.blogcityusa.com/blog/desa/722/ - http://www.blogcityusa.com/blog/desa/723/ - http://www.blogcityusa.com/blog/desa/724/ - http://www.blogcityusa.com/blog/desa/725/ - http://www.blogcityusa.com/blog/desa/726/ - http://www.blogcityusa.com/blog/desa/727/ - http://www.blogcityusa.com/blog/desa/728/ - http://www.blogcityusa.com/blog/desa/729/ - http://www.blogcityusa.com/blog/desa/730/ - http://www.blogcityusa.com/blog/desa/731/ - http://www.blogcityusa.com/blog/desa/732/ - http://www.blogcityusa.com/blog/desa/733/ - http://www.blogcityusa.com/blog/desa/734/ - http://www.blogcityusa.com/blog/desa/735/ - http://www.blogcityusa.com/blog/desa/736/ - http://www.blogcityusa.com/blog/desa/737/ - http://www.blogcityusa.com/blog/desa/738/ - http://www.blogcityusa.com/blog/desa/739/ - http://www.blogcityusa.com/blog/desa/740/ - http://www.blogcityusa.com/blog/desa/741/ - http://www.blogcityusa.com/blog/desa/742/ - http://www.blogcityusa.com/blog/desa/743/ - http://www.blogcityusa.com/blog/desa/744/ - http://www.blogcityusa.com/blog/desa/745/ - http://www.blogcityusa.com/blog/desa/746/ - http://www.blogcityusa.com/blog/desa/747/ - http://www.blogcityusa.com/blog/desa/748/ - http://www.blogcityusa.com/blog/desa/749/ - http://www.blogcityusa.com/blog/desa/750/ - http://www.blogcityusa.com/blog/desa/751/ - http://www.blogcityusa.com/blog/desa/752/ - http://www.blogcityusa.com/blog/desa/753/ - http://www.blogcityusa.com/blog/desa/754/ - http://www.blogcityusa.com/blog/desa/755/ - http://www.blogcityusa.com/blog/desa/756/ - http://www.blogcityusa.com/blog/desa/757/ - http://www.blogcityusa.com/blog/desa/758/ - http://www.blogcityusa.com/blog/desa/759/ - http://www.blogcityusa.com/blog/desa/760/ - http://www.blogcityusa.com/blog/desa/761/ - http://www.blogcityusa.com/blog/desa/762/ - http://www.blogcityusa.com/blog/desa/763/ - http://www.blogcityusa.com/blog/desa/764/ - http://www.blogcityusa.com/blog/desa/765/ - http://www.blogcityusa.com/blog/desa/766/ - http://www.blogcityusa.com/blog/desa/767/ - http://www.blogcityusa.com/blog/desa/768/ - http://www.blogcityusa.com/blog/desa/769/ - http://www.blogcityusa.com/blog/desa/770/ - http://www.blogcityusa.com/blog/desa/771/ - http://www.blogcityusa.com/blog/desa/772/ - http://www.blogcityusa.com/blog/desa/773/ - http://www.blogcityusa.com/blog/desa/774/ - http://www.blogcityusa.com/blog/desa/775/ - http://www.blogcityusa.com/blog/desa/776/ - http://www.blogcityusa.com/blog/desa/777/ - http://www.blogcityusa.com/blog/desa/778/ - http://www.blogcityusa.com/blog/desa/779/ - http://www.blogcityusa.com/blog/desa/780/ - http://www.blogcityusa.com/blog/desa/781/ - http://www.blogcityusa.com/blog/desa/782/ - http://www.blogcityusa.com/blog/desa/783/ - http://www.blogcityusa.com/blog/desa/784/ - http://www.blogcityusa.com/blog/desa/785/ - http://www.blogcityusa.com/blog/desa/786/ - http://www.blogcityusa.com/blog/desa/787/ - http://www.blogcityusa.com/blog/desa/788/ - http://www.blogcityusa.com/blog/desa/789/ - http://www.blogcityusa.com/blog/desa/790/ - http://www.blogcityusa.com/blog/desa/791/ - http://www.blogcityusa.com/blog/desa/792/ - http://www.blogcityusa.com/blog/desa/793/ - http://www.blogcityusa.com/blog/desa/794/ - http://www.blogcityusa.com/blog/desa/795/ - http://www.blogcityusa.com/blog/desa/796/ - http://www.blogcityusa.com/blog/desa/797/ - http://www.blogcityusa.com/blog/desa/798/ - http://www.blogcityusa.com/blog/desa/799/ - http://www.blogcityusa.com/blog/desa/800/ - http://www.blogcityusa.com/blog/desa/801/ - http://www.blogcityusa.com/blog/desa/802/ - http://www.blogcityusa.com/blog/desa/803/ - http://www.blogcityusa.com/blog/desa/804/ - http://www.blogcityusa.com/blog/desa/805/ - http://www.blogcityusa.com/blog/desa/806/ - http://www.blogcityusa.com/blog/desa/807/ - http://www.blogcityusa.com/blog/desa/808/ - http://www.blogcityusa.com/blog/desa/809/ - http://www.blogcityusa.com/blog/desa/810/ - http://www.blogcityusa.com/blog/desa/811/ - http://www.blogcityusa.com/blog/desa/812/ - http://www.blogcityusa.com/blog/desa/813/ - http://www.blogcityusa.com/blog/desa/814/ - http://www.blogcityusa.com/blog/desa/815/ - http://www.blogcityusa.com/blog/desa/816/ - http://www.blogcityusa.com/blog/desa/817/ - http://www.blogcityusa.com/blog/desa/818/ - http://www.blogcityusa.com/blog/desa/819/ - http://www.blogcityusa.com/blog/desa/820/ - http://www.blogcityusa.com/blog/desa/821/ - http://www.blogcityusa.com/blog/desa/822/ - http://www.blogcityusa.com/blog/desa/823/ - http://www.blogcityusa.com/blog/desa/824/ - http://www.blogcityusa.com/blog/desa/825/ - http://www.blogcityusa.com/blog/desa/826/ - http://www.blogcityusa.com/blog/desa/827/ - http://www.blogcityusa.com/blog/desa/828/ - http://www.blogcityusa.com/blog/desa/829/ - http://www.blogcityusa.com/blog/desa/830/ - http://www.blogcityusa.com/blog/desa/831/ - http://www.blogcityusa.com/blog/desa/832/ - http://www.blogcityusa.com/blog/desa/833/ - http://www.blogcityusa.com/blog/desa/834/ - http://www.blogcityusa.com/blog/desa/835/ - http://www.blogcityusa.com/blog/desa/836/ - http://www.blogcityusa.com/blog/desa/837/ - http://www.blogcityusa.com/blog/desa/838/ - http://www.blogcityusa.com/blog/desa/839/ - http://www.blogcityusa.com/blog/desa/840/ - http://www.blogcityusa.com/blog/desa/841/ - http://www.blogcityusa.com/blog/desa/842/ - http://www.blogcityusa.com/blog/desa/843/ - http://www.blogcityusa.com/blog/desa/844/ - http://www.blogcityusa.com/blog/desa/845/ - http://www.blogcityusa.com/blog/desa/846/ - http://www.blogcityusa.com/blog/desa/847/ - http://www.blogcityusa.com/blog/desa/848/ - http://www.blogcityusa.com/blog/desa/849/ - http://www.blogcityusa.com/blog/desa/850/ - http://www.blogcityusa.com/blog/desa/851/ - http://www.blogcityusa.com/blog/desa/852/ - http://www.blogcityusa.com/blog/desa/853/ - http://www.blogcityusa.com/blog/desa/854/ - http://www.blogcityusa.com/blog/desa/855/ - http://www.blogcityusa.com/blog/desa/856/ - http://www.blogcityusa.com/blog/desa/857/ - http://www.blogcityusa.com/blog/desa/858/ - http://www.blogcityusa.com/blog/desa/859/ - http://www.blogcityusa.com/blog/desa/860/ - http://www.blogcityusa.com/blog/desa/861/ - http://www.blogcityusa.com/blog/desa/862/ - http://www.blogcityusa.com/blog/desa/863/ - http://www.blogcityusa.com/blog/desa/864/ - http://www.blogcityusa.com/blog/desa/865/ - http://www.blogcityusa.com/blog/desa/866/ - http://www.blogcityusa.com/blog/desa/867/ - http://www.blogcityusa.com/blog/desa/868/ - http://www.blogcityusa.com/blog/desa/869/ - http://www.blogcityusa.com/blog/desa/870/ - http://www.blogcityusa.com/blog/desa/871/ - http://www.blogcityusa.com/blog/desa/872/ - http://www.blogcityusa.com/blog/desa/873/ - http://www.blogcityusa.com/blog/desa/874/ - http://www.blogcityusa.com/blog/desa/875/ - http://www.blogcityusa.com/blog/desa/876/ - http://www.blogcityusa.com/blog/desa/877/ - http://www.blogcityusa.com/blog/desa/878/ - http://www.blogcityusa.com/blog/desa/879/ - http://www.blogcityusa.com/blog/desa/880/ - http://www.blogcityusa.com/blog/desa/881/ - http://www.blogcityusa.com/blog/desa/882/ - http://www.blogcityusa.com/blog/desa/883/ - http://www.blogcityusa.com/blog/desa/884/ - http://www.blogcityusa.com/blog/desa/885/ - http://www.blogcityusa.com/blog/desa/886/

Posted by Bill Karwin (bkarwin) on 2006-11-13T15:10:27.000+0000

Changing fix version to unknown.

Posted by Bill Karwin (bkarwin) on 2006-11-13T15:26:54.000+0000

Changing fix version to 0.6.0.

Posted by Alexander Veremyev (alexander) on 2007-01-24T19:46:59.000+0000

Issue is resolved with current encoding management support and new utf-8 anslyzer.

Have you found an issue?

See the Overview section for more details.

Copyright

© 2006-2016 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.

Contacts