Issues

ZF-6088: Mixing 'find' and 'addDocument' with empty fields results in wrong indexes

Description

If you mix "find" and "addDocument" (e.g. if you want to update an existing document), and some of the documents are filled with empty strings, this may result in a wrong index. Here is an example:

// Start over with a fresh index
passthru("rm -rf test.index");
$index = Zend_Search_Lucene::create("text.index");

// The list if entries we want to index
$entries = array(
  'asdf',
  'abc',
  'def',
  'ghj',
  'klm',
  'nop',
  'qrs',
  'uvw',
  'abc',
  '', // If this field is not empty, everything is fine!
  'hij',
);

foreach ($entries AS $key => $entry) {
  // Find and delete existing documents
  if ($old = $index->find("pk:$key")){
    // Note that this code is never reached in this example
    $index->delete($old->id);
  }

  // Add new document
  $doc = new Zend_Search_Lucene_Document();
  $doc->addField(Zend_Search_Lucene_Field::UnIndexed("pk", $pk));
  $doc->addField(Zend_Search_Lucene_Field::Text("contents", $entry));
  $index->addDocument($doc);
}

foreach ($index->find("asdf") AS $r) {
  echo "{$r->contents} ($r->score}\n";
}

// Expected result: asdf (1)
// Actual result: ghj (1)

Comments

This seems to have nothing to do with empty values. I'm having this problem on real-life-data even if I'm filtering empty fields. The only thing that seems to help is to set the following parameters:

      $index->setMaxBufferedDocs(PHP_INT_MAX);
      $index->setMaxMergeDocs(1);
      $index->setMergeFactor(100);

Sorry, even that doesn't solved the problem.

I found out that this problem does occur if an field is empty or totally omitted. The only solutions seems to be to add dummy data if an field is empty:

    // Remove an existing entry
    if ($index->find('pk:'.$this->id))
    {
      $index->delete($hit->id);
    }

    $doc = new Zend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('pk', $this->id));
    $fields = array('first_name', 'last_name', 'company', 'address', 'city');
    foreach ($fields AS $field)
    {
      if ("" != ($value = trim($this->$field)))
      {
        $doc->addField(Zend_Search_Lucene_Field::UnStored($field, $value, sfConfig::get('sf_charset')));
      }else {
        // Field is empty, add dummy data
        $doc->addField(Zend_Search_Lucene_Field::UnStored($field, "null", sfConfig::get('sf_charset')));
      }
    }

    $index->addDocument($doc);

Fixed.

Thenks for really helpful issue description!