Issues

ZF-2598: regex for wildcard query (single-placeholder) is wrong

Description

If I search for "lebe?" it correctly matches "leber", "leben" etc. But if I search for "le?en" it doesn't match anything. I dumped the used pattern and it seems there is a bug. The pattern generated for "lebe?" is "/^lebe.$/u" which is correct but for "le?en" the generated pattern is "/^.$/u" which is obviously plain wrong. I didn't track it further but my guess would be that the problem is in the QueryParser. It seems to work only if the placeholder is at the start or end of the term but fails if it's in between.

Comments

I found the problem. This happends because the wildcard-term is tokenized but my tokenizer discards tokens smaller than 2 characters. Therefore "le" before the questionmark and "en" after it are discarded and only the questionmark remains. But since it's only tokenized to ensure that it's not a phrase I'd say it's an error that the tokens are looped through.

instead of $tokens = Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($subPatternL2, $encoding); if (count($tokens) > 1) { throw new Zend_Search_Lucene_Search_QueryParserException('Wildcard search is supported only for non-multiple word terms'); } #var_dump($tokens); foreach ($tokens as $token) { $pattern .= $token->getTermText(); }

do this $tokens = Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($subPatternL2, $encoding); if (count($tokens) > 1) { throw new Zend_Search_Lucene_Search_QueryParserException('Wildcard search is supported only for non-multiple word terms'); } $pattern .= $subPatternL2;

and it should work as expected

i completely forgot to mention, the postet code is in Zend_Search_Lucene_Search_QueryEntry_Term around the line 145

btw, my fix won't work correctly in all cases since any special characters at the start and end of the term won't be discarded as they would be if the tokenized term would be used (punctuation, etc.). I think a better solution would be to allow for disabling the analyzer filters:

$analyzer = Zend_Search_Lucene_Analysis_Analyzer::getDefault(); $analyzer->setFiltersEnabled(false); $tokens = $analyzer->tokenize($subPatternL2, $encoding); $analyzer->setFiltersEnabled(true); if (count($tokens) > 1) { throw new Zend_Search_Lucene_Search_QueryParserException('Wildcard search is supported only for non-multiple word terms'); } foreach ($tokens as $token) { $pattern .= $token->getTermText(); }

i think this sould work perfectly with the default anaylzers, although it may still be a problem with custom written analyzers?

Stefan,

Thank you very much for your detailed research of the problem!

It has one additional aspect. Filter are used also for normalizing terms (e.g. converting to lowercase, stemming and so on). Such normalization has to be performed during indexing and query parsing to get correct terms matching.

So we have two types of filters: 1) normalization filters 2) "filtering" filters (e.g. stop words , short words filters). That needs more accurate API.

I've decreased issue priority to postpone it to post-1.5 release period.

PS Actually normalization filters may transform term completely (e.g. stemming filters). I am not sure, such filters may be compatible with wildcard searching.

This doesn't appear to have been fixed in 1.5.0. Please update if this is not correct.

As far as I can see, it still is an issue. But it might not need to be fixed

A possible workaround solution is for the user to use a different analyzer for searching and for indexing.

One example where I did this is the use of synonyme analysers, that provide several alternative tokens for a single word in the original text. If all those alternatives end up in the index, its enough to search for a single one of them, since all alternatives are already in the index. (And unless ZF-7738 gets integrated, multiple tokens for a single term raise exceptions in fuzzy and wildcard searches anyway)

Similar with stop word filtering. It makes sense during indexing, but its counter-productive during search.

A stem word filter however may make sense to be applied both during indexing and during search. (though it may affect wildcard searches weirdly)

Since its no problem to switch analyzers between the different tasks I don't think this bug needs to be fixed. It my make more sense to document these effects in the official documentation.

Rather than fix this, which seems a bit convoluted, the approach suggested in last comment is sound. Therefore have re-categorised this to a Critical Docs Improvement.

Would anyone like to write such a documentation improvement before 1.12 is frozen?