Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 1.7.7
-
Fix Version/s: 1.9.6
-
Component/s: Zend_Search_Lucene
-
Labels:None
Description
If you mix "find" and "addDocument" (e.g. if you want to update an existing document), and some of the documents are filled with empty strings, this may result in a wrong index. Here is an example:
// Start over with a fresh index
passthru("rm -rf test.index");
$index = Zend_Search_Lucene::create("text.index");
// The list if entries we want to index
$entries = array(
'asdf',
'abc',
'def',
'ghj',
'klm',
'nop',
'qrs',
'uvw',
'abc',
'', // If this field is not empty, everything is fine!
'hij',
);
foreach ($entries AS $key => $entry) {
// Find and delete existing documents
if ($old = $index->find("pk:$key")){
// Note that this code is never reached in this example
$index->delete($old->id);
}
// Add new document
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed("pk", $pk));
$doc->addField(Zend_Search_Lucene_Field::Text("contents", $entry));
$index->addDocument($doc);
}
foreach ($index->find("asdf") AS $r) {
echo "{$r->contents} ($r->score}\n";
}
// Expected result: asdf (1)
// Actual result: ghj (1)
This seems to have nothing to do with empty values. I'm having this problem on real-life-data even if I'm filtering empty fields. The only thing that seems to help is to set the following parameters:
$index->setMaxBufferedDocs(PHP_INT_MAX); $index->setMaxMergeDocs(1); $index->setMergeFactor(100);$index->setMaxBufferedDocs(PHP_INT_MAX); $index->setMaxMergeDocs(1); $index->setMergeFactor(100);