Indexing - Getting Started with Zend_Search_Lucene
Indexing is performed by adding a new document to an existing or new index:
There are two ways to create document object. The first is to do it manually.
Example #1 Manual Document Construction
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
- $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
- $doc->addField(Zend_Search_Lucene_Field::unStored('contents', $docBody));
- $doc->addField(Zend_Search_Lucene_Field::binary('avatar', $avatarData));
The second method is to load it from HTML or Microsoft Office 2007 files:
Example #2 Document loading
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($path);
- $doc = Zend_Search_Lucene_Document_Pptx::loadPptFile($path);
- $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($path);
If a document is loaded from one of the supported formats, it still can be extended manually with new user defined fields.
You should define indexing policy within your application architectural design.
You may need an on-demand indexing configuration (something like OLTP system). In such systems, you usually add one document per user request. As such, the MaxBufferedDocs option will not affect the system. On the other hand, MaxMergeDocs is really helpful as it allows you to limit maximum script execution time. MergeFactor should be set to a value that keeps balance between the average indexing time (it's also affected by average auto-optimization time) and search performance (index optimization level is dependent on the number of segments).
If you will be primarily performing batch index updates, your configuration should use a MaxBufferedDocs option set to the maximum value supported by the available amount of memory. MaxMergeDocs and MergeFactor have to be set to values reducing auto-optimization involvement as much as possible . Full index optimization should be applied after indexing.
Example #3 Index optimization
In some configurations, it's more effective to serialize index updates by organizing update requests into a queue and processing several update requests in a single script execution. This reduces index opening overhead, and allows utilizing index document buffering.