Issues

ZF-3380: Zend_Search_Lucene indexer hangs (time out). Possibly due to recursion.

Description

We have written a crawler based on code we found on the net. The code has run smoothly, however on numerous occassions it now hangs on certain pages - which have been changed and I guess it's the content of the pages that influences the behaviour. The script dies when the php max exec time is reached. An example page with which the statement "$index->addDocument($doc);" hangs is: http://www.febe.be/nl_BE/page/show/id/41

Please note that it seems most likely that the issue is not caused by the content of the page, because all other pages from that point onwards fail.


/**
 * SE siteCrawler v0.1
 * 
 * Generates a Lucene index per supported language
 * 
 * Uses:
 * - Zend_HTTP
 * - Zend_Search_Lucene
 * 
 * Todo:
 * - Currently deletes and rebuilds, should be dynamic
 * - Index pdf files & other
 * - Check for redirects
 * 
 * @author patrick @ studioemma.com
 */

header('Content-Type: text/html; charset=utf-8');
include("prepend.php");
set_time_limit(3600);

// Update these constants for correct usage.
define("_crawler_pathToLucenIndex",$application->getPath("Lucene"));
define("_crawler_indexName","FEBE");

require_once 'Zend/Search/Lucene.php';
require_once 'Zend/Http/Client.php';
require_once 'Zend/Log.php';
require_once 'Zend/Log/Writer/Stream.php';

foreach ($application->getLanguages() as $supportedLanguage) {

    
    $start_uri = $application->getUrl("Absolute").$supportedLanguage."/index/index2";
    $match_uri = $application->getUrl("Absolute").$supportedLanguage."/";
    
    $logFile = _crawler_pathToLucenIndex._crawler_indexName.".".$supportedLanguage.'.log';
    $indexFile = _crawler_pathToLucenIndex._crawler_indexName.".".$supportedLanguage.'.index';
    
    // Should be rebuilt, but currently we remove the index
    system("rm -rf $indexFile");
    system("rm -rf $logFile");
//die($start_uri);
    
    // Set up log
    //$log = new Zend_Log(new Zend_Log_Writer_Stream($logFile));
    ob_implicit_flush(true);
    $log = new Zend_Log(new Zend_Log_Writer_Stream('php://output'));
    $log->info('Crawler starting up');

    // Set up Zend_Http_Client
    $client = new Zend_Http_Client();
    $client->setConfig(array('timeout' => 30));


    // Open a Lucene index, or create it if it does not exist
    // Make it possible to create more than one index:
    $indexpath = $indexFile;

    try {
        $index = Zend_Search_Lucene::open($indexpath);
        $log->info("Opened existing index in $indexpath");
    // If can't open, try creating
    } catch (Zend_Search_Lucene_Exception $e) {
        try {
            $index = Zend_Search_Lucene::create($indexpath);
            $log->info("Created new index in $indexpath");
        // If both fail, give up and show error message
        } catch(Zend_Search_Lucene_Exception $e) {
            $log->err("Failed opening or creating index in $indexpath");
            $log->err($e->getMessage());
            echo "Unable to open or create index: {$e->getMessage()}";
            exit(1);
        }
    }

    // Set up the targets array
    $targets = array($start_uri);
    // Start iterating
    for($i = 0; $i < count($targets); $i++) {
        // Fetch content with HTTP Client
        $client->setUri($targets[$i]);
        $response = $client->request();
        if ($response->isSuccessful()) {

            // Possibly check for redirects, don't know if this is currently done automaticaly

            $body = $response->getBody();
            $log->info("Fetched " . strlen($body) . " bytes from {$targets[$i]}");

            // Create document
            $body_checksum = md5($body);
            try {
                $doc = Zend_Search_Lucene_Document_Html::loadHTML($body);
            }
            catch (Exception $e) {
                // It's possibly another document
                echo "
Another document
"; } $docMeta = array("description"=>""); $docMeta = @get_meta_tags($targets[$i]); // Add Fields to the IndexDocument $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $targets[$i])); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('md5', $body_checksum)); //$doc->addField(Zend_Search_Lucene_Field::UnIndexed('description', $docMeta["description"])); $log->info("Start indexing {$targets[$i]}"); // Index the IndexDocument $index->addDocument($doc); $log->info("Indexed {$targets[$i]}"); // Fetch new links $links = $doc->getLinks(); foreach ($links as $link) { // Resolve relative links: if (strpos($link, "http://") === false) { $link = $application->getUrl("Absolute") . substr($link,1); } // Add the link if applicable if ((strpos($link, $match_uri) !== false) && (! in_array($link, $targets))) { $targets[] = $link; } } } else { $log->warn("Requesting {$targets[$i]} returned HTTP " . $response->getStatus()); } } $log->info("Iterated over " . count($targets) . " documents"); $log->info("Optimizing index..."); $index->optimize(); $index->commit(); $log->info("Done. Index now contains " . $index->numDocs() . " documents"); $log->info("Crawler shutting down"); }

Comments

PHP is 5.2.3, memory limit set to 100M

Apparently this issue is caused when overloading string functions. Most likely due to using str_len to count the bytes.

So the issue can be logged to review in later revisions.

We resolved a similar problem with commenting out ;mbstring.internal_encoding = UTF-8 ;mbstring.func_overload = 7 in php.ini.

We traced the problem to the following code in Zend_Search_Lucene_Index_SegmentWriter -> _generateCFS()
while ($byteCount > 0) { $data = $dataFile->readBytes(min($byteCount, 131072 /*128Kb*/)); $byteCount -= strlen($data); $cfsFile->writeBytes($data); }

We would really like this problem to be resolved soon as we are planning to support UTF-8 in our application.

I met exactly the same problem as Eric.

We have several servers running zend lucene and only one failed de create index. I found out that mbstring.func_overload was accidentaly activated on this server.

The symptom was an infinite loop in Zend_Search_Lucene_Index_SegmentWriter::_generateCFS

readBytes did not return the right value. I had 5 bytes missing.

Maybe it's not actually a bug (mbstring.func_overload is a really weird option that prevents binary file handling) but you should prevent the execution to get into an infinite loop and raise a "mbstring.func_overload not supported" exception.

I still experience the same issue in ZF 1.10 I managed to solve it by disabling mbstring.func_overload. That however prevented my app working with utf-8, which is not an option :(

How to reproduce the bug:

Add this at the end of php.ini:

mbstring.func_overload = 7

Then during building the search index, script hangs in an infinite loop at:

Zend_Search_Lucene_Index_SegmentWriter::_generateCFS()

while ($byteCount > 0) { $data = $dataFile->readBytes(min($byteCount, 131072 /128Kb/)); $byteCount -= strlen($data); // here the length is always 1 $cfsFile->writeBytes($data); }

The indexed data contains UTF-8 characters, such as: ąę×ć...

(the bug still exists in ZF 1.11)

@Tomek: Would using www.php.net/manual/en/function.mb-strlen.php" rel="nofollow">mb_strlen fix this issue there?

Is there any possible fix here? It seems that the indexer needs mbstring in order to properly index UTF-8 characters, but turning it on (via func_overload) causes an infinite loop.