3. Component Requirements, Constraints, and Acceptance Criteria
This component will reduce inflected (or sometimes derived) words to their stem.
This component will not handle different encodings (All must be UTF-8).
This component will not split words from texts.
4. Dependencies on Other Framework Components
Zend_Filter
Zend_Loader_PluginLoader
Zend_Exception
5. Theory of Operation
The main filter Zend_Filter_PorterStemmer loads the localized stemmer with a configurable plugin loader.
The build-in localized stemmers are located in Zend_Filter_PorterStemmer_* and can be used directly.
6. Milestones / Tasks
Milestone 1: [DONE] Finish proposal
Milestone 2: Working prototype
Milestone 3: Prototype checked into the incubator
Milestone 4: Unit tests exist finished and component is working
*) The name of the filter "PorterStemmer" does not reflect it's usage. I would not know what this filter is intended for and I think most others too when I would not have read the wiki article.
*) I see a problem in getting the filter definitions for languages. When you say this filter is "localized" but we would be able to provide filters just for 5 languages then it's not really localized. Where would you get the filters from to be able to work on about 40 locales from 140 locales ?
I dither a little bit about the name. My reasons for PorterStemmer was the following:
"Stemming" is the general what this filter should do
The filter should be a noun (Stemmer)
Only "Stemmer" could be overlap with different stemmers based on other algorithms
I'm not sure if the filter should be located on Zend_Filter_Word because its only handle words as argument.
Yes you are right - we can't implement the algorithm for every locale and on some languages like zh the algorithm isn't applicable but in fact stemming is different for every language and sometimes I think it could be useful to differ between the complete locale.
The optimal goal would to implement the algorithm for every language where stemming is applicable.
-> The Snowball link shows 14 languages
-> On Apache Lucene there are additional languages of this algorithm
2 Comments
comments.show.hideMar 28, 2010
Thomas Weidner
I see actually two problems:
*) The name of the filter "PorterStemmer" does not reflect it's usage. I would not know what this filter is intended for and I think most others too when I would not have read the wiki article.
*) I see a problem in getting the filter definitions for languages. When you say this filter is "localized" but we would be able to provide filters just for 5 languages then it's not really localized. Where would you get the filters from to be able to work on about 40 locales from 140 locales ?
Mar 28, 2010
Marc Bennewitz (private)
Hi Thomas
The optimal goal would to implement the algorithm for every language where stemming is applicable.
-> The Snowball link shows 14 languages
-> On Apache Lucene there are additional languages of this algorithm