Issues

ZF-3994: Lucene optimize() consumes 64M (or more) when optimizing indices created with stock settings in ZF 1.6.0 RC2

Description

Lucene optimize() consumes 64M (or more) when optimizing indices created with stock settings in ZF 1.6.0 RC2.

Data was originally total 71,000 items, around 36MB total size after indexing with stock Lucene settings. After setting memory_limit = 128M in /etc/php.ini and then re-running ZSL->optimize(), the resulting single index file ("_2eaw.cfs") is 31,798,340 bytes. If memory_limit = 64M in php.ini, then ZSL->optimize() bombs. Following two messages appeared:

This first message appeared when memory_limit = 64M and before it was increased to 128M allowing optimization to succeed:

{quote} Fatal error: Allowed memory size of 67108864 bytes exhausted (tried to allocate 1024 bytes) in /var/www/SHARED/ZEND_FRAMEWORK/ZendFramework-1.6.0RC2/library/Zend/Search/Lucene/Index/SegmentInfo.php on line 1591 {quote}

This second message appeared after optimize() succeeded (when memory_limit = 128M) and another item was added to index and memory_limit was reset to 64M in php.ini, followed by attempt to optimize() again:

{quote} Fatal error: Allowed memory size of 67108864 bytes exhausted (tried to allocate 71 bytes) in /var/www/SHARED/ZEND_FRAMEWORK/ZendFramework-1.6.0RC2/library/Zend/Search/Lucene/Index/SegmentWriter.php on line 453 {quote}

Comments

This is actually a serious issue and affects all versions of ZF that include Lucene.

When building large indexes, and then running the optimize() method, PHP will throw a "Allowed memory size of ... bytes exhausted". Since the PHP process has a set memory limit, and your index file needs to be loaded into memory for Zend Search Lucene to handle it, the ratio seems to be about 2:1 memory_limit value to index size, in order to prevent memory exhaustion.

There's probably a better way to construct the optimize() method to handle optimizing the index files, however I do not have time atm to fix this for the community.

I'm not sure if anyone has found a work around for the problem, but if you are working with a very large index, say 500MB+, I have found the easiest solution is to setup a separate web service running Apache Solr and using the dataimport handler to build/maintain the indexes.