ZF-11009: Highlighting does not work as expected (accented characters)


The query get's tokenized twice ( ? ). First time in class Zend_Search_Lucene_Search_Query_Preprocessing_Term on line 299:

$tokens = Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($this->_word, $this->_encoding);

Now everything's okay - $this->_encoding = utf-8. But what happens next is that we have another ->tokenize() method call in Zend_Search_Lucene_Document_Html on line 427 (inside foreach):

$wordsToHighlightList[] = $analyzer->tokenize($wordString);

but the analyzer is missing encoding (line 424):

$analyzer = Zend_Search_Lucene_Analysis_Analyzer::getDefault();

... and unlike during the first tokenize call I get corrupted string, and in effect - no highlight. In the analyzer the encoding appears as an empty string. When I change the code from line 427 to:

$wordsToHighlightList[] = $analyzer->tokenize($wordString, 'utf-8');

.. everything works fine.


$query = "jaźń";

try {
    $index = Zend_Search_Lucene::open($this->getOption('indexDirectory'));
}catch(Zend_Search_Lucene_Exception $e){
    $index = Zend_Search_Lucene::create($this->getOption('indexDirectory'));

    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::text('title', 'żółć gęślą jaźń ąćź','utf-8'));

$query = Zend_Search_Lucene_Search_QueryParser::parse($query);

try {
    $results = $index->find($query);
    foreach ($results as $result){
        echo $query->highlightMatches($result->title, 'utf-8') . '
'; } }catch(Exception $e){}


No comments to display