ZF-85: Query Parser not handling fieldbname with underscores

Description

parsing a query such as:

$query = Zend_Search_Lucene_Search_QueryParser::parse('title:bob');

correctly creates a private term with field set to title and value to bob like so: [_term:private] => Zend_Search_Lucene_Index_Term Object ( [field] => title [text] => bob )

However if the field contains an underscore it tokenizes from the underscore for example:

$query = Zend_Search_Lucene_Search_QueryParser::parse('title_en:bob');

gives: [_terms:private] => Array ( [0] => Zend_Search_Lucene_Index_Term Object ( [field] => contents [text] => title )

        [1] => Zend_Search_Lucene_Index_Term Object
            (
                [field] => en
                [text] => bob
            )

    )

This may be expected behaviour. Maybe underscores should be banned from index fields. I was using them because we've a bilingual collection so my fields are: title_en title_gd contents_en contents_gd etc. ( which has been fun trying to work around inability to set default field!)

But it would strike me that the use of ctype_alnum to decide on token types in QueryTokenizer may be a tad stricter than necessary.

Esoteric one this I'm sure though. Probably the only person in the world who's used underscores in their fields :)

Comments

Anything happening on this issue? If so, set a fix version with the expected time frame it will come in, otherwise assign to Alex. Thanks. I'm setting fo r 0.3.0 in the meantime.

I found the problem, tomorrow i will submit here the fixed verion of the tokenizer.

Here is the diff:

Index: C:/Apps/www/lib/3rdparty/zend_framework/library/Zend/Search/Lucene/Search/QueryTokenizer.php

--- C:/Apps/www/lib/3rdparty/zend_framework/library/Zend/Search/Lucene/Search/QueryTokenizer.php (revision 5742) +++ C:/Apps/www/lib/3rdparty/zend_framework/library/Zend/Search/Lucene/Search/QueryTokenizer.php (revision 5743) @@ -64,7 +64,7 @@

     $currentToken = '';
     for ($count = 0; $count < strlen($inputString); $count++) {

- if (ctype_alnum( $inputString{$count} )) { + if (ctype_alnum( $inputString{$count} ) || $inputString{$count} == "_") { $currentToken .= $inputString{$count}; } else { // Previous token is finished

The issue is already fixed. Please take current SVN version (http://framework.zend.com/wiki/display/…) Sorry that I missed your first comment, so you made work which is already done.

And welcome to development team! :)

PS Have you already signed CLA? (http://framework.zend.com/faq/contributing#q2)