Programmer's Reference Guide
| Query Construction API |
Character set.
UTF-8 and single-byte character sets support.
Zend_Search_Lucene works with UTF-8 charset internally. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports it with one exception. [1]
Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding.
Default text analyzer.
However, default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.
ctype_alpha() is not UTF-8 compatible, so analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is performed during query parsing, so it's done transparently. [2]
UTF-8 compatible text analyzer.
Zend_Search_Lucene also contains limited functionality utf-8 analyzer. It can be turned on with the following code:
<?php
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
It tokenizes data for indexing in UTF-8 mode and has no problems with any UTF-8 compatible character.
It has two limitations:
-
treats all non-ascii characters as letters (it's not always true);
-
is case-sensitive;
Because of these limitations it's not set as default, but may be helpful for someone.
Case insensitivity my be emulated with strtolower() conversion:
<?php
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
...
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
...
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', strtolower($contents)));
// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title', strtolower($title)));
// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
The same conversion has to be performed with query string:
<?php
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
...
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
...
$hits = $index->find(strtolower($query));
| Query Construction API |
