Issues

ZF-3629: Lucene result highlighting

Description

Hello,

i have a problem with the lucene result highlighting. The highlighted words or even fragments were nothing i've searched for.

It seems to me that the dom-function TextNode->splitText is not multibyte-save. Is that right?

I make this assumption because when i do this:

echo mb_substr($node->wholeText,$token->getStartOffset(),$token->getEndOffset()-$token->getStartOffset(),'UTF-8');

just before the splitting in (Document/Html.php) i get the correct text, but the main splitting is made somewhere else in the text.

Comments

Has to do with http://bugs.php.net/bug.php?id=46335

I have written a workaround for it, but it depends on mb_substr(). I'm not sure if it's ok for lucene to depend on multibyte. If not a problem, I'll submit the patch.

Btw, I have only tested this on a debian lenny with php 5.2.6, so I'm not entirely sure the bug still exists in PHP at all. If not, this issue could be resolved anyway.

Just a quick comment to let you know that a pretty bad workaround is just to create an instance of the parser manually:

 
// Builds a list of words from the search terms
$words = explode(' ', mb_strtolower($search));

// Initializes the Lucene parser and tokenizes the excerpt
$analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive();
$analyzer->setInput($content, "utf-8");

// Initializes a list of tokens for the search terms
$tokens = array();

// Retrieves the first token from the excerpt
$token = $analyzer->nextToken();

// Parses each token and processes only the ones corresponding to the
// search terms
while ($token !== null) {
  // Stores any token found matching a search term
  if (array_value_exists($token->getTermText(), $words)) {
    $tokens[] = $token;
  }
  
  $token = $analyzer->nextToken();
}

// Defines a padding to be able to compute the real position of each 
// token whenever some 'strong' tags is added to the excerpt
$pad = 0;

foreach ($tokens as $token) {
  // Retrieves the position of the search term thanks to the token
  $start = $token->getStartOffset() + $pad;
  $end = $token->getEndOffset() + $pad;
  
  // Highlights the search term
  $content = mb_substr($content, 0, $start) . "" . mb_substr($content, $start, $end - $start) . "" . mb_substr($content, $end);
  
  // Increases the padding
  $pad += mb_strlen(""); 
}

My fix can be implemented with iconv too. Since that is an essential dependency of Zend_Search_Lucene anyway, I will submit a patch for it asap.

I have the same problem. The following changes to the file Zend/Search/Lucene/Document/Html.php seems to fix this problem in my case :

 

--- old 2010-09-17 17:02:04.000000000 +0200
+++ new 2010-09-17 17:03:43.000000000 +0200
@@ -306,11 +306,20 @@
         $matchedTokens = array_reverse($matchedTokens);

         foreach ($matchedTokens as $token) {
+
+            // Convert characters offset to bytes offsets
+            // Because splitText() bellow does not handle multibytes characters
+            // see http://bugs.php.net/bug.php?id=46335
+            $startOffset = $token->getStartOffset();
+            $endOffset = $token->getEndOffset();
+            $bytesStartOffset = strlen(mb_substr($node->wholeText, 0, $startOffset, 'UTF-8'));
+            $bytesEndOffset = strlen(mb_substr($node->wholeText, 0, $endOffset, 'UTF-8'));
+
             // Cut text after matched token
-            $node->splitText($token->getEndOffset());
+            $node->splitText($bytesEndOffset);

             // Cut matched node
-            $matchedWordNode = $node->splitText($token->getStartOffset());
+            $matchedWordNode = $node->splitText($bytesStartOffset);

             // Retrieve HTML string representation for highlihted word
             $fullCallbackparamsList = $params;