Issues

ZF-3380: Zend_Search_Lucene indexer hangs (time out). Possibly due to recursion.

Issue Type: Bug Created: 2008-06-04T02:12:38.000+0000 Last Updated: 2012-08-31T08:41:07.000+0000 Status: Open Fix version(s): Reporter: Patrick Vandenbrande (havik3008) Assignee: Alexander Veremyev (alexander) Tags: - Zend_Search_Lucene

  • After1.12.0
  • zf-crteam-padraic
  • zf-crteam-priority

Related issues: Attachments:

Description

We have written a crawler based on code we found on the net. The code has run smoothly, however on numerous occassions it now hangs on certain pages - which have been changed and I guess it's the content of the pages that influences the behaviour. The script dies when the php max exec time is reached. An example page with which the statement "$index->addDocument($doc);" hangs is: http://www.febe.be/nl_BE/page/show/id/41

Please note that it seems most likely that the issue is not caused by the content of the page, because all other pages from that point onwards fail.

<pre class="highlight">
/**
 * SE siteCrawler v0.1
 * 
 * Generates a Lucene index per supported language
 * 
 * Uses:
 * - Zend_HTTP
 * - Zend_Search_Lucene
 * 
 * Todo:
 * - Currently deletes and rebuilds, should be dynamic
 * - Index pdf files & other
 * - Check for redirects
 * 
 * @author patrick @ studioemma.com
 */

header('Content-Type: text/html; charset=utf-8');
include("prepend.php");
set_time_limit(3600);

// Update these constants for correct usage.
define("_crawler_pathToLucenIndex",$application->getPath("Lucene"));
define("_crawler_indexName","FEBE");

require_once 'Zend/Search/Lucene.php';
require_once 'Zend/Http/Client.php';
require_once 'Zend/Log.php';
require_once 'Zend/Log/Writer/Stream.php';

foreach ($application->getLanguages() as $supportedLanguage) {

    
    $start_uri = $application->getUrl("Absolute").$supportedLanguage."/index/index2";
    $match_uri = $application->getUrl("Absolute").$supportedLanguage."/";
    
    $logFile = _crawler_pathToLucenIndex._crawler_indexName.".".$supportedLanguage.'.log';
    $indexFile = _crawler_pathToLucenIndex._crawler_indexName.".".$supportedLanguage.'.index';
    
    // Should be rebuilt, but currently we remove the index
    system("rm -rf $indexFile");
    system("rm -rf $logFile");
//die($start_uri);
    
    // Set up log
    //$log = new Zend_Log(new Zend_Log_Writer_Stream($logFile));
    ob_implicit_flush(true);
    $log = new Zend_Log(new Zend_Log_Writer_Stream('<a>php://output</a>'));
    $log->info('Crawler starting up');

    // Set up Zend_Http_Client
    $client = new Zend_Http_Client();
    $client->setConfig(array('timeout' => 30));


    // Open a Lucene index, or create it if it does not exist
    // Make it possible to create more than one index:
    $indexpath = $indexFile;

    try {
        $index = Zend_Search_Lucene::open($indexpath);
        $log->info("Opened existing index in $indexpath");
    // If can't open, try creating
    } catch (Zend_Search_Lucene_Exception $e) {
        try {
            $index = Zend_Search_Lucene::create($indexpath);
            $log->info("Created new index in $indexpath");
        // If both fail, give up and show error message
        } catch(Zend_Search_Lucene_Exception $e) {
            $log->err("Failed opening or creating index in $indexpath");
            $log->err($e->getMessage());
            echo "Unable to open or create index: {$e->getMessage()}";
            exit(1);
        }
    }

    // Set up the targets array
    $targets = array($start_uri);
    // Start iterating
    for($i = 0; $i < count($targets); $i++) {
        // Fetch content with HTTP Client
        $client->setUri($targets[$i]);
        $response = $client->request();
        if ($response->isSuccessful()) {

            // Possibly check for redirects, don't know if this is currently done automaticaly

            $body = $response->getBody();
            $log->info("Fetched " . strlen($body) . " bytes from {$targets[$i]}");

            // Create document
            $body_checksum = md5($body);
            try {
                $doc = Zend_Search_Lucene_Document_Html::loadHTML($body);
            }
            catch (Exception $e) {
                // It's possibly another document
                echo "


Another document

";
            }

            $docMeta = array("description"=>"");
            $docMeta = @get_meta_tags($targets[$i]);

            // Add Fields to the IndexDocument
            $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $targets[$i]));
            $doc->addField(Zend_Search_Lucene_Field::UnIndexed('md5', $body_checksum));
            //$doc->addField(Zend_Search_Lucene_Field::UnIndexed('description', $docMeta["description"]));
                $log->info("Start indexing {$targets[$i]}");
                // Index the IndexDocument
                $index->addDocument($doc);
        
            $log->info("Indexed {$targets[$i]}");

            // Fetch new links
            $links = $doc->getLinks();
            foreach ($links as $link) {
                // Resolve relative links:
                if (strpos($link, "http://") === false) {
                    $link = $application->getUrl("Absolute") . substr($link,1);
                }
                // Add the link if applicable
                if ((strpos($link, $match_uri) !== false) && (! in_array($link, $targets))) {
                    $targets[] = $link;
                }
            }
        } else {
            $log->warn("Requesting {$targets[$i]} returned HTTP " . $response->getStatus());
        }
    }
    $log->info("Iterated over " . count($targets) . " documents");
    $log->info("Optimizing index...");
    $index->optimize();
    $index->commit();
    $log->info("Done. Index now contains " . $index->numDocs() . " documents");
    $log->info("Crawler shutting down");
}

Comments

Posted by Patrick Vandenbrande (havik3008) on 2008-06-04T03:12:15.000+0000

PHP is 5.2.3, memory limit set to 100M

Posted by Patrick Vandenbrande (havik3008) on 2008-06-04T13:55:16.000+0000

Apparently this issue is caused when overloading string functions. Most likely due to using str_len to count the bytes.

So the issue can be logged to review in later revisions.

Posted by Eric (e) on 2009-06-19T05:33:18.000+0000

We resolved a similar problem with commenting out ;mbstring.internal_encoding = UTF-8 ;mbstring.func_overload = 7 in php.ini.

We traced the problem to the following code in Zend_Search_Lucene_Index_SegmentWriter -> _generateCFS()
while ($byteCount > 0) { $data = $dataFile->readBytes(min($byteCount, 131072 /*128Kb*/)); $byteCount -= strlen($data); $cfsFile->writeBytes($data); }

We would really like this problem to be resolved soon as we are planning to support UTF-8 in our application.

Posted by Frédéric Choquet (fchoquet) on 2009-06-26T08:42:25.000+0000

I met exactly the same problem as Eric.

We have several servers running zend lucene and only one failed de create index. I found out that mbstring.func_overload was accidentaly activated on this server.

The symptom was an infinite loop in Zend_Search_Lucene_Index_SegmentWriter::_generateCFS

readBytes did not return the right value. I had 5 bytes missing.

Maybe it's not actually a bug (mbstring.func_overload is a really weird option that prevents binary file handling) but you should prevent the execution to get into an infinite loop and raise a "mbstring.func_overload not supported" exception.

Posted by michal kralik (ceecko) on 2010-01-31T06:57:31.000+0000

I still experience the same issue in ZF 1.10 I managed to solve it by disabling mbstring.func_overload. That however prevented my app working with utf-8, which is not an option :(

Posted by Tomek Pęszor (admirau) on 2010-10-27T07:08:37.000+0000

How to reproduce the bug:

Add this at the end of php.ini:

mbstring.func_overload = 7

Then during building the search index, script hangs in an infinite loop at:

Zend_Search_Lucene_Index_SegmentWriter::_generateCFS()

while ($byteCount > 0) { $data = $dataFile->readBytes(min($byteCount, 131072 /128Kb/)); $byteCount -= strlen($data); // here the length is always 1 $cfsFile->writeBytes($data); }

The indexed data contains UTF-8 characters, such as: ąę×ć...

(the bug still exists in ZF 1.11)

Posted by Adam Lundrigan (adamlundrigan) on 2012-02-24T12:22:50.000+0000

@Tomek: Would using www.php.net/manual/en/function.mb-strlen.php" rel="nofollow">mb_strlen fix this issue there?

Posted by Adam Lundrigan (adamlundrigan) on 2012-05-30T13:10:15.000+0000

Is there any possible fix here? It seems that the indexer needs mbstring in order to properly index UTF-8 characters, but turning it on (via func_overload) causes an infinite loop.

Have you found an issue?

See the Overview section for more details.

Copyright

© 2006-2016 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.

Contacts