ZF-3683: Performance improvement: reuse token object in lowercase filter

Description

There is a small performance improvement that could be made to the part of the code that creates indexes. In class Zend_Search_Lucene_Analysis_Analyzer_Common_Text in the nextToken() method the following line of code is executed:

$token = $this->normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

This calls the normalize() method of Zend_Search_Lucene_Analysis_TokenFilter_LowerCase. A Token object is created as part of the call to normalize(), inside normalize() a second Token object is created which is an exact copy of the first apart from the fact that the text is converted to lower case.

This means that two Token objects are created for every token - this has some impact on performance (about 7% on the examples that I've looked at). One very simple fix would be to change normalize() so that it doesn't create a second object but just updates the text in the object that is passed to it. This would also require a change to Token.php to allow the text field to be set. I expect that there are more architecturally pleasing ways to fix.

Comments

Re-evaluated to major since Lucene already is very memory hungry 7% is quite a bit, and this can probably be fixed very easily (as reporter suggested).

Here is a patch implementing the issue submitter's suggestion:


Index: library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCase.php
===================================================================
--- library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCase.php   (revision 24527)
+++ library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCase.php   (working copy)
@@ -45,14 +45,8 @@
      */
     public function normalize(Zend_Search_Lucene_Analysis_Token $srcToken)
     {
-        $newToken = new Zend_Search_Lucene_Analysis_Token(
-                                     strtolower( $srcToken->getTermText() ),
-                                     $srcToken->getStartOffset(),
-                                     $srcToken->getEndOffset());
-
-        $newToken->setPositionIncrement($srcToken->getPositionIncrement());
-
-        return $newToken;
+        $srcToken->setTermText(strtolower($srcToken->getTermText()));
+        return $srcToken;
     }
 }
 
Index: library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCaseUtf8.php
===================================================================
--- library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCaseUtf8.php   (revision 24527)
+++ library/Zend/Search/Lucene/Analysis/TokenFilter/LowerCaseUtf8.php   (working copy)
@@ -57,14 +57,8 @@
      */
     public function normalize(Zend_Search_Lucene_Analysis_Token $srcToken)
     {
-        $newToken = new Zend_Search_Lucene_Analysis_Token(
-                                     mb_strtolower($srcToken->getTermText(), 'UTF-8'),
-                                     $srcToken->getStartOffset(),
-                                     $srcToken->getEndOffset());
-
-        $newToken->setPositionIncrement($srcToken->getPositionIncrement());
-
-        return $newToken;
+        $srcToken->setTermText(mb_strtolower($srcToken->getTermText(), 'UTF-8'));
+        return $srcToken;
     }
 }
 
Index: library/Zend/Search/Lucene/Analysis/Token.php
===================================================================
--- library/Zend/Search/Lucene/Analysis/Token.php   (revision 24527)
+++ library/Zend/Search/Lucene/Analysis/Token.php   (working copy)
@@ -123,6 +123,18 @@
     {
         return $this->_termText;
     }
+    
+    /**
+     * Sets the Token's term text.
+     * 
+     * @param string $text
+     * @return this
+     */
+    public function setTermText($text)
+    {
+        $this->_termText = $text;
+        return $this;
+    }
 
     /**
      * Returns this Token's starting offset, the position of the first character

The Zend_Search_Lucene test suite still passes after the above modifications are made, however the area of Zend_Search_Lucene in question is not unit tested.

Are there any backwards-compatibility issues with changing this behavior now? Previously, an entirely new token would come out of {{normalize()}}, but with the above modifications the provided token is modified in-place and returned. Are there any situations where this would be inappropriate?

Fixed in trunk r24832