Issues

ZF-7853: Problem with highlightMatches() function

Description

Specific example of error follows:

I have a .pptx document that contains this text string: "... Data Protection Module Karen Brown, DPO Johnny McSweeney ...". The document appears to be indexed correctly.

Using both quoted and un-quoted strings, if I do a search for "Karen" or "Karen Brown", the results come back fine, and using the highlightMatches() function, the keywords are highlighted. However, if I do a search for "Module Karen Brown" I get the following error:

[Notice] Trying to get property of non-object POST /SilverStripe/search-lucene/

Line 293 in D:\wamp\www\ZendFramework-1.9.1\library\Zend\Search\Lucene\Document\Html.php Source

284 foreach ($matchedTokens as $token) { 285 // Cut text after matched token 286 $node->splitText($token->getEndOffset()); 287 288 // Cut matched node 289 $matchedWordNode = $node->splitText($token->getStartOffset()); 290 291 // Retrieve HTML string representation for highlihted word 292 $fullCallbackparamsList = $params; 293 array_unshift($fullCallbackparamsList, $matchedWordNode->nodeValue); 294 $highlightedWordNodeSetHtml = call_user_func_array($callback, $fullCallbackparamsList); 295 296 // Transform HTML string to a DOM representation and automatically transform retrieved string 297 // into valid XHTML (It's automatically done by loadHTML() method) 298 $highlightedWordNodeSetDomDocument = new DOMDocument('1.0', 'UTF-8'); 299 $success = @$highlightedWordNodeSetDomDocument->

My totally stripped down output code looks a bit like this:

foreach ($results as $result) { $myVar = $query->highlightMatches($result->body); echo $myVar; }

If I alter: $myVar = $query->highlightMatches($result->body); to: $myVar = $result->body; the error no longer occurs, but of course, the keywords are no longer highlighted.

This is just one specific example, but I get similar difficulties with .xlsx and .docx documents, and random combinations of search terms.

Interestingly, I don't get the same problem with documents created from database queries.

(NB. I originally sent this issue to Maarten Balliauw, who suggested I submit it to the Issue Tracker as it affects multiple lucene modules.)

Comments

My first idea was that it's utf-8 processing related problem (DOMText::splitText() method needs binary offset instead of character offset), but it was wrong. I think I have found and fixed problem, so it will be commited soon. BTW could you prepare test document and attach it to the issue?

Hi Alexander

I can't unfortunately give you the actual document as it's confidential to the business of this office. Instead, I've prepared a very cut-down single page version of the same document, which contains the bits that were giving me problems. I hope that's enough to be getting on with.

Cut-down version of test file.

Michael, I've tried document you attached to the issue and haven't got a error.

Could you give a ZF version number you use now? Similar problem was fixed earlier.

Second question is a way you get $query. Does it only:


$query = Zend_Search_Lucene_Search_QueryParser::parse('Module Karen Brown');

???

BTW My test code:


$doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($pathToTestPptxFile, true);

$index = Zend_Search_Lucene::create('index');
$index->addDocument($doc);

$query = Zend_Search_Lucene_Search_QueryParser::parse('Module Karen Brown');

$results = $index->find($query);
foreach ($results as $result) { $myVar = $query->highlightMatches($result->body); echo $myVar; }

Moreover, this code may be reduced to the following:


$doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($pathToTestPptxFile, true);
$query = Zend_Search_Lucene_Search_QueryParser::parse('Module Karen Brown');
var_dump($query->highlightMatches($doc->body));

Please try it in your environment.

Hi Alexander

Am using: ZendFramework-1.9.1

My indexing code looks like this:

$doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($doc_filenameandpath, true); $doc->addField(Zend_Search_Lucene_Field::Keyword('key', $doc_filename)); ... other addFields()... $index->addDocument($doc);

My search code looks like this: (note that $searchString would contain, in this example, 'Module Karen Brown' or 'Karen Brown', etc)

$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($searchString); $query = new Zend_Search_Lucene_Search_Query_Boolean(); $query->addSubquery($userQuery, true); $index = Zend_Search_Lucene::open($pathToIndexes . $indexName . '\');
$results = $index->find($query);

The major difference appears to be that I pass the string into Zend_Search_Lucene_Search_QueryParser and then pass that into Zend_Search_Lucene_Search_Query_Boolean(). You are doing it all in one step.

I'll convert my code to look more like yours, but do you think this is significant when we're talking about highlightMatches()?

Alexander

Nope, it's just the same.

If I condense my original three lines:

$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($searchString);
$query = new Zend_Search_Lucene_Search_Query_Boolean(); 
$query->addSubquery($userQuery, true /* required */); 

to this single line:

$query = Zend_Search_Lucene_Search_QueryParser::parse($searchString);

I get exactly the same error in exactly the same circumstances.

m\

Could it actually be the word "Module" that is causing the problem? Is it some kind of reserved word..?

I mentioned that because:

"Modul Karen Brown" does not cause the error, "Module aren Brown" does cause the error.

Hi Alexander

This issue is no longer an issue :-)

I moved the application I was building to a new server, ready for its official release. On the new server, this problem has entirely disappeared!

What is strange is that I'm using the same code, which has simply been copied across, and I'm even using the same test file! The only difference is in the server infrastructure (one was an XP workstation and the other is a Windows 2003 server) and I'm fairly certain that can't have been the issue.

But anyway, as it appears you can't replicate the issue, and it's no longer a concern for me, I guess I'm happy for the issue to be closed.

Thanks for your help.

m\

Bulk change of all issues last updated before 1st January 2010 as "Won't Fix".

Feel free to re-open and provide a patch if you want to fix this issue.