Skip to end of metadata
Go to start of metadata

Zend Framework: Zend_Filter_PorterStemmer Component Proposal

Proposed Component Name Zend_Filter_PorterStemmer
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_PorterStemmer
Proposers Marc Bennewitz
Zend Liaison TBD
Revision 1.0 - 27 March 2010: Initial Draft. (wiki revision: 3)

Table of Contents

1. Overview

Zend_Filter_PorterStemmer is a set of localized stemmers in base of the porter stemmer algorithm.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This component will reduce inflected (or sometimes derived) words to their stem.
  • This component will not handle different encodings (All must be UTF-8).
  • This component will not split words from texts.

4. Dependencies on Other Framework Components

  • Zend_Filter
  • Zend_Loader_PluginLoader
  • Zend_Exception

5. Theory of Operation

The main filter Zend_Filter_PorterStemmer loads the localized stemmer with a configurable plugin loader.
The build-in localized stemmers are located in Zend_Filter_PorterStemmer_* and can be used directly.

6. Milestones / Tasks

  • Milestone 1: [DONE] Finish proposal
  • Milestone 2: Working prototype
  • Milestone 3: Prototype checked into the incubator
  • Milestone 4: Unit tests exist finished and component is working
  • Milestone 5: Initial documentation exists
  • Milestone 6: Changed related components
  • Milestone 7: Moved to core.

7. Class Index

  • Zend_Filter_PorterStemmer
  • Zend_Filter_PorterStemmer_<language>
  • Zend_Filter_PorterStemmer_<language>_<region>

8. Use Cases

UC-01
UC-02

9. Class Skeletons

Labels:
filter filter Delete
stemmer stemmer Delete
stem stem Delete
word word Delete
local local Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Mar 28, 2010

    I see actually two problems:

    *) The name of the filter "PorterStemmer" does not reflect it's usage. I would not know what this filter is intended for and I think most others too when I would not have read the wiki article.

    *) I see a problem in getting the filter definitions for languages. When you say this filter is "localized" but we would be able to provide filters just for 5 languages then it's not really localized. Where would you get the filters from to be able to work on about 40 locales from 140 locales ?

    1. Mar 28, 2010

      Hi Thomas

      • I dither a little bit about the name. My reasons for PorterStemmer was the following:
        • "Stemming" is the general what this filter should do
        • The filter should be a noun (Stemmer)
        • Only "Stemmer" could be overlap with different stemmers based on other algorithms
        • I'm not sure if the filter should be located on Zend_Filter_Word because its only handle words as argument.
      • Yes you are right - we can't implement the algorithm for every locale and on some languages like zh the algorithm isn't applicable but in fact stemming is different for every language and sometimes I think it could be useful to differ between the complete locale.
        The optimal goal would to implement the algorithm for every language where stemming is applicable.
        -> The Snowball link shows 14 languages
        -> On Apache Lucene there are additional languages of this algorithm