ZF-6899: Can not search chinese string by chinese UTF-8 word sequence

Issue Type: Improvement Created: 2009-06-03T20:52:48.000+0000 Last Updated: 2012-11-21T08:16:56.000+0000 Status: Reopened Fix version(s): Reporter: Kevin (mincer) Assignee: None Tags: - Zend_Search_Lucene

Related issues: - ZF-7738



StandardAnalyzer in java lucene could search chinese string by chinese UTF-8 word sequence. For example: There are two document, one is "文件编辑器", another is "编码格式" . When we search string "编辑", the analyzer we use will cut the string into "编" and "辑". We will get both of these two document result in Zend, but in Java lucene, we could get the correct doc - the first document.

Could Zend search consider the relative position in two terms? Thanks!


Posted by old of Satoru Yoshida ( on 2009-06-04T02:43:31.000+0000

Chinese is one of the languages that you do not leave space mark between words by a word unit.

Unfortunately , Zend_Search_Lucene seems to not support Chinese, and also incorporating languages, agglutinative languages now.

I think it would be 'postpone'.

Posted by Kevin (mincer) on 2009-06-04T18:49:04.000+0000

Hi Satoru,

Thanks for your reply. I have investigated this issue by comparing Java lucene, and found the root cause is that Zend only uses Zend_Search_Lucene_Search_Query_MultiTerm for each clasue separated by Zend_Search_Lucene_Search_QueryParser(please see Zend\Search\Lucene\Search\QueryEntry\Term.php:line 182).

Is there any way to use Zend_Search_Lucene_Search_Query_Phrase for each sub clause?

Posted by Corvus Corax (corvuscorax) on 2009-09-01T02:35:43.000+0000

I just wrote a patch that - among other things, does exactly that. Though in Zend\Search\Lucene\Search\Query/Preprocessing/Term.php


Previously if a Term search phrase was tokenized into several tokens by the analyzer, Zend\Search\Lucene\Search\Query/Preprocessing/Term.php created a Zend_Search_Lucene_Search_Query_MultiTerm query in its rewrite() function. I changed that into a Zend_Search_Lucene_Search_Query_Phrase

Theoretically that should fix the chinese phrase problem.

Ill run a test on the two supplied strings...

Posted by Corvus Corax (corvuscorax) on 2009-09-01T03:14:00.000+0000

I am obviously missing the analyzer you mentioned that tokenizes each chinese word into a seperate token. Common_Utf8Num just tokenises all of 文件编辑器 into one token which can be matched with * 编辑 * (However with terrible performance due to the leading *)

I still think the patch should in theory fix that, but I need someone with access to that Analyzer to confirm.

Posted by Rob Allen (rob) on 2012-11-20T20:53:33.000+0000

Bulk change of all issues last updated before 1st January 2010 as "Won't Fix".

Feel free to re-open and provide a patch if you want to fix this issue.

Posted by Corvus Corax (corvuscorax) on 2012-11-21T08:16:56.000+0000

Bug has been closed by default in "Bulk change". However patch is already attached to bug (in this case to ZF-7738), what has been missing is a maintainer

Have you found an issue?

See the Overview section for more details.


© 2006-2018 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.