ZF-2857: Zend_Search_Lucene_Query::highlightMatches doesn't take the Analyzers encoding into account


Zend_Search_Lucene_Query::highlightMatches doesn't ensure that the highlighted text has the same encoding as the the analyzer uses to extract the tokens. This results in wrong token offsets and ultimately breaks the highlighting.

To reproduce just pass a multibyte string to the method while using the default analyzer. One can easily work around this issue by converting the text manually to ASCII//TRANSLIT before invoking highlightMatches.

Currently the code in Zend_Search_Lucene_Document_Html::_highlightTextNode looks like this:

$analyzer = Zend_Search_Lucene_Analysis_Analyzer::getDefault();
$analyzer->setInput($node->nodeValue, $this->_doc->encoding); //converts from _doc->encoding to ASCII//TRANSLIT
foreach ($matchedTokens as $token) {
    // Cut text after matched token
    $node->splitText($token->getEndOffset()); //uses wrong character offset
    // ...

I suggest to provide a method in the analyzer to convert any text to its internal encoding and invoke this function before creating the Zend_Search_Lucene_Document_Html instance like this (in Zend_Search_Lucene_Search_Query::highlightMatches):

$input = Zend_Search_Lucene_Analysis_Analyzer::getDefault()->encode($inputHTML);
$doc = Zend_Search_Lucene_Document_Html::loadHTML($input);


Please categorize/fix as needed.


For example - POP3 - will get email via telnet. With Exchange and Lotus Notes we use client software to connect/retrieve email.