Issues

ZF-3683: Performance improvement: reuse token object in lowercase filter

Issue Type: Bug Created: 2008-07-18T09:39:12.000+0000 Last Updated: 2012-05-30T13:15:19.000+0000 Status: Resolved Fix version(s): - 1.12.0 (27/Aug/12)

Reporter: zoe slattery (zoe) Assignee: Adam Lundrigan (adamlundrigan) Tags: - Zend_Search_Lucene

  • FixForZF1.12
  • state:need-feedback
  • zf-caretaker-adamlundrigan
  • zf-crteam-padraic
  • zf-crteam-priority
  • zf-crteam-review

Related issues: Attachments: - ZF-3683.patch

Description

There is a small performance improvement that could be made to the part of the code that creates indexes. In class Zend_Search_Lucene_Analysis_Analyzer_Common_Text in the nextToken() method the following line of code is executed:

$token = $this->normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

This calls the normalize() method of Zend_Search_Lucene_Analysis_TokenFilter_LowerCase. A Token object is created as part of the call to normalize(), inside normalize() a second Token object is created which is an exact copy of the first apart from the fact that the text is converted to lower case.

This means that two Token objects are created for every token - this has some impact on performance (about 7% on the examples that I've looked at). One very simple fix would be to change normalize() so that it doesn't create a second object but just updates the text in the object that is passed to it. This would also require a change to Token.php to allow the text field to be set. I expect that there are more architecturally pleasing ways to fix.

Comments

Posted by Benjamin Eberlei (beberlei) on 2008-11-08T00:31:46.000+0000

Re-evaluated to major since Lucene already is very memory hungry 7% is quite a bit, and this can probably be fixed very easily (as reporter suggested).

Posted by Adam Lundrigan (adamlundrigan) on 2011-10-24T17:40:40.000+0000

Here is a patch implementing the issue submitter's suggestion:

<pre class="highlight">
Index: library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCase.php
===================================================================
--- library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCase.php   (revision 24527)
+++ library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCase.php   (working copy)
@@ -45,14 +45,8 @@
      */
     public function normalize(Zend_Search_Lucene_Analysis_Token $srcToken)
     {
-        $newToken = new Zend_Search_Lucene_Analysis_Token(
-                                     strtolower( $srcToken->getTermText() ),
-                                     $srcToken->getStartOffset(),
-                                     $srcToken->getEndOffset());
-
-        $newToken->setPositionIncrement($srcToken->getPositionIncrement());
-
-        return $newToken;
+        $srcToken->setTermText(strtolower($srcToken->getTermText()));
+        return $srcToken;
     }
 }
 
Index: library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCaseUtf8.php
===================================================================
--- library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCaseUtf8.php   (revision 24527)
+++ library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCaseUtf8.php   (working copy)
@@ -57,14 +57,8 @@
      */
     public function normalize(Zend_Search_Lucene_Analysis_Token $srcToken)
     {
-        $newToken = new Zend_Search_Lucene_Analysis_Token(
-                                     mb_strtolower($srcToken->getTermText(), 'UTF-8'),
-                                     $srcToken->getStartOffset(),
-                                     $srcToken->getEndOffset());
-
-        $newToken->setPositionIncrement($srcToken->getPositionIncrement());
-
-        return $newToken;
+        $srcToken->setTermText(mb_strtolower($srcToken->getTermText(), 'UTF-8'));
+        return $srcToken;
     }
 }
 
Index: library/Zend/Search/Lucene/Analysis/Token.php
===================================================================
--- library/Zend/Search/Lucene/Analysis/Token.php   (revision 24527)
+++ library/Zend/Search/Lucene/Analysis/Token.php   (working copy)
@@ -123,6 +123,18 @@
     {
         return $this->_termText;
     }
+    
+    /**
+     * Sets the Token's term text.
+     * 
+     * @param string $text
+     * @return this
+     */
+    public function setTermText($text)
+    {
+        $this->_termText = $text;
+        return $this;
+    }
 
     /**
      * Returns this Token's starting offset, the position of the first character

The Zend_Search_Lucene test suite still passes after the above modifications are made, however the area of Zend_Search_Lucene in question is not unit tested.

Are there any backwards-compatibility issues with changing this behavior now? Previously, an entirely new token would come out of normalize(), but with the above modifications the provided token is modified in-place and returned. Are there any situations where this would be inappropriate?

Posted by Adam Lundrigan (adamlundrigan) on 2012-05-30T13:15:07.000+0000

Fixed in trunk r24832

Have you found an issue?

See the Overview section for more details.

Copyright

© 2006-2016 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.

Contacts