ZF-11009: Highlighting does not work as expected (accented characters)
Description
The query get's tokenized twice ( ? ). First time in class Zend_Search_Lucene_Search_Query_Preprocessing_Term on line 299:
$tokens = Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($this->_word, $this->_encoding);
Now everything's okay - $this->_encoding = utf-8. But what happens next is that we have another ->tokenize() method call in Zend_Search_Lucene_Document_Html on line 427 (inside foreach):
$wordsToHighlightList[] = $analyzer->tokenize($wordString);
but the analyzer is missing encoding (line 424):
$analyzer = Zend_Search_Lucene_Analysis_Analyzer::getDefault();
... and unlike during the first tokenize call I get corrupted string, and in effect - no highlight. In the analyzer the encoding appears as an empty string. When I change the code from line 427 to:
$wordsToHighlightList[] = $analyzer->tokenize($wordString, 'utf-8');
.. everything works fine.
Code:
$query = "jaźń";
try {
$index = Zend_Search_Lucene::open($this->getOption('indexDirectory'));
}catch(Zend_Search_Lucene_Exception $e){
$index = Zend_Search_Lucene::create($this->getOption('indexDirectory'));
}
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::text('title', 'żółć gęślą jaźń ąćź','utf-8'));
$index->addDocument($doc);
$index->commit();
$query = Zend_Search_Lucene_Search_QueryParser::parse($query);
try {
$results = $index->find($query);
foreach ($results as $result){
echo $query->highlightMatches($result->title, 'utf-8') . '
';
}
}catch(Exception $e){}
Comments
No comments to display