Issues

ZF-11009: Highlighting does not work as expected (accented characters)

Description

The query get's tokenized twice ( ? ). First time in class Zend_Search_Lucene_Search_Query_Preprocessing_Term on line 299:


$tokens = Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($this->_word, $this->_encoding);

Now everything's okay - $this->_encoding = utf-8. But what happens next is that we have another ->tokenize() method call in Zend_Search_Lucene_Document_Html on line 427 (inside foreach):


$wordsToHighlightList[] = $analyzer->tokenize($wordString);

but the analyzer is missing encoding (line 424):


$analyzer = Zend_Search_Lucene_Analysis_Analyzer::getDefault();

... and unlike during the first tokenize call I get corrupted string, and in effect - no highlight. In the analyzer the encoding appears as an empty string. When I change the code from line 427 to:


$wordsToHighlightList[] = $analyzer->tokenize($wordString, 'utf-8');

.. everything works fine.

Code:


$query = "jaźń";

try {
    $index = Zend_Search_Lucene::open($this->getOption('indexDirectory'));
}catch(Zend_Search_Lucene_Exception $e){
    $index = Zend_Search_Lucene::create($this->getOption('indexDirectory'));
}

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::text('title', 'żółć gęślą jaźń ąćź','utf-8'));
$index->addDocument($doc);
$index->commit();

$query = Zend_Search_Lucene_Search_QueryParser::parse($query);

try {
    $results = $index->find($query);
    foreach ($results as $result){
        echo $query->highlightMatches($result->title, 'utf-8') . '
'; } }catch(Exception $e){}

Comments

No comments to display