ZF-3629: Lucene result highlighting
Description
Hello,
i have a problem with the lucene result highlighting. The highlighted words or even fragments were nothing i've searched for.
It seems to me that the dom-function TextNode->splitText is not multibyte-save. Is that right?
I make this assumption because when i do this:
echo mb_substr($node->wholeText,$token->getStartOffset(),$token->getEndOffset()-$token->getStartOffset(),'UTF-8');
just before the splitting in (Document/Html.php) i get the correct text, but the main splitting is made somewhere else in the text.
Comments
Posted by Gerard van Helden (drm) on 2009-11-11T13:20:07.000+0000
Has to do with http://bugs.php.net/bug.php?id=46335
I have written a workaround for it, but it depends on mb_substr(). I'm not sure if it's ok for lucene to depend on multibyte. If not a problem, I'll submit the patch.
Btw, I have only tested this on a debian lenny with php 5.2.6, so I'm not entirely sure the bug still exists in PHP at all. If not, this issue could be resolved anyway.
Posted by Stéphane (stephane) on 2009-11-20T06:29:56.000+0000
Just a quick comment to let you know that a pretty bad workaround is just to create an instance of the parser manually:
// Builds a list of words from the search terms $words = explode(' ', mb_strtolower($search)); // Initializes the Lucene parser and tokenizes the excerpt $analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive(); $analyzer->setInput($content, "utf-8"); // Initializes a list of tokens for the search terms $tokens = array(); // Retrieves the first token from the excerpt $token = $analyzer->nextToken(); // Parses each token and processes only the ones corresponding to the // search terms while ($token !== null) { // Stores any token found matching a search term if (array_value_exists($token->getTermText(), $words)) { $tokens[] = $token; } $token = $analyzer->nextToken(); } // Defines a padding to be able to compute the real position of each // token whenever some 'strong' tags is added to the excerpt $pad = 0; foreach ($tokens as $token) { // Retrieves the position of the search term thanks to the token $start = $token->getStartOffset() + $pad; $end = $token->getEndOffset() + $pad; // Highlights the search term $content = mb_substr($content, 0, $start) . "" . mb_substr($content, $start, $end - $start) . "" . mb_substr($content, $end); // Increases the padding $pad += mb_strlen(""); }Posted by Gerard van Helden (drm) on 2009-11-24T04:46:40.000+0000
My fix can be implemented with iconv too. Since that is an essential dependency of Zend_Search_Lucene anyway, I will submit a patch for it asap.
Posted by Nicolas Huguet (nicolas.huguet) on 2010-09-17T08:08:57.000+0000
I have the same problem. The following changes to the file Zend/Search/Lucene/Document/Html.php seems to fix this problem in my case :
--- old 2010-09-17 17:02:04.000000000 +0200 +++ new 2010-09-17 17:03:43.000000000 +0200 @@ -306,11 +306,20 @@ $matchedTokens = array_reverse($matchedTokens); foreach ($matchedTokens as $token) { + + // Convert characters offset to bytes offsets + // Because splitText() bellow does not handle multibytes characters + // see http://bugs.php.net/bug.php?id=46335 + $startOffset = $token->getStartOffset(); + $endOffset = $token->getEndOffset(); + $bytesStartOffset = strlen(mb_substr($node->wholeText, 0, $startOffset, 'UTF-8')); + $bytesEndOffset = strlen(mb_substr($node->wholeText, 0, $endOffset, 'UTF-8')); + // Cut text after matched token - $node->splitText($token->getEndOffset()); + $node->splitText($bytesEndOffset); // Cut matched node - $matchedWordNode = $node->splitText($token->getStartOffset()); + $matchedWordNode = $node->splitText($bytesStartOffset); // Retrieve HTML string representation for highlihted word $fullCallbackparamsList = $params;