Issues

ZF-11717: Corrupt index files with Zend_Search_Lucene

Description

After several attempts to index data coming from a DB I always obtain the following Exception:

Fatal error: Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Error occured while file reading.' in /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Storage/File/Filesystem.php:174 Stack trace:

0 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Storage/File.php(471): Zend_Search_Lucene_Storage_File_Filesystem->_fread(33554431)

1 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/SegmentMerger.php(203): Zend_Search_Lucene_Storage_File->readBinary()

2 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/SegmentMerger.php(126): Zend_Search_Lucene_Index_SegmentMerger->_mergeStoredFields()

3 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/Writer.php(385): Zend_Search_Lucene_Index_SegmentMerger->merge()

4 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/Writer.php(341): Zend_Search_Lucene_Index_Writer->_mergeSegments(Array)

5 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/Writer.php(250): Zend_Search_Luce in /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Storage/File/Filesystem.php on line 174

Notice the the param in the _fread call causing the exception is always 33554431 (which I believe is the data in the index file causing the corruption when merging). In the tries the exception is not always in the same set of data (as the data is changing) but is very likely that it happens more or less after the same amount of data sets are read from DB.

The indexing is made by a separate program and is not concurrent (i.e: is the only process writing the index and there is no other neither writing nor reading from index files).

When trying to optimize the index the following is obtained:

Fatal error: Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Error occured while file reading.' in /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Storage/File/Filesystem.php:174 Stack trace:

0 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Storage/File.php(471): Zend_Search_Lucene_Storage_File_Filesystem->_fread(33554431)

1 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/SegmentMerger.php(203): Zend_Search_Lucene_Storage_File->readBinary()

2 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/SegmentMerger.php(126): Zend_Search_Lucene_Index_SegmentMerger->_mergeStoredFields()

3 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/Writer.php(385): Zend_Search_Lucene_Index_SegmentMerger->merge()

4 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Index/Writer.php(807): Zend_Search_Lucene_Index_Writer->_mergeSegments(Array)

5 /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene.php(1456): Zend_Search_Lucene_Index_Wri in /usr/local/ZendFramework-1.11.10/library/Zend/Search/Lucene/Storage/File/Filesystem.php on line 174

I've downloaded lucene-2.9.4 and run Check.Index on the index file and while it recovers the index (by deleting several hundreds of docuemnts) the resulting index files are in a format not (yet, I hope) understood by Zend Framework.

My Zend Framework is 1.11.10.

My PHP is 5.3.6 with the following modules compiled in:

[PHP Modules] Core ctype curl date dom ereg fileinfo filter gd gettext hash iconv json libxml mbstring mhash mysqli openssl pcntl pcre PDO pdo_mysql pdo_sqlite Phar posix Reflection rrdtool session SimpleXML sockets SPL SQLite sqlite3 standard tidy tokenizer xml xmlreader xmlwriter zip zlib

Finally, I found that at least 1 document what misidentified as being UTF-8 encoded but really it was ISO-8859-1. All documents were indexed as being UTF-8. Enforcing a hard encoding identification seems to have solved the problem. As documents were downloaded from the Internet it is unwise to trust given document encoding.

Whereas programmer should assure the documents are properly encoded I do really believe that some kind of check/fix index files should be provided inside the framework as relying in other tools (i.e: Lucene core pack) could recover the corrupted index files but leave them in a format not (yet) handled by the Framework.

Comments

No comments to display