Skip to end of metadata
Go to start of metadata

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Feed_Reader Component Proposal

Proposed Component Name Zend_Feed_Reader
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Feed_Reader
Proposers Pádraic Brady
Jurriën Stutterheim
Alexander Veremyev (Zend Liaison)
Revision 1.0.0 - 31 July 2008 (wiki revision: 11)

Table of Contents

1. Overview

Zend_Feed was originally created to offer a natural API akin to SimpleXML to the data contained in RSS and Atom feeds. The primary mechanism driving this is an abstract API to a DOMDocument representation of the XML feed.

Except for several exceptions, Zend_Feed does not attempt to interpret RSS or Atom. It does not understand the formats, has no knowledge of various RSS and Atom versions, makes no attempt to use best practice HTTP retrieval methods, does not validate or filter typed return values (with some exceptions) and does not currently support an internal cache for infrequently modified feeds.

Aside from the above desirable features which are not currently present, Zend_Feed also presents an inconsistent API where method calls, property calls, and return types are so unpredictable that it is a severe impediment to the natural API that leads to a lot of trial and error programming. The simplest examples are easy to spot. If you take an entry from an RSS feed (as $entry) supporting <content:encoded>, <dc:creator>, and <slash:comments> elements the values of each (i.e. a typed result, not a DOMNode object) are retrieved using the following:

  • $entry->content(); (method name based on XML namespace prefix of element)
  • $entry->creator(); (method name based on tagname omitting XML namespace prefix)
  • $entry->'slash:comments'; (property name based on dynamic namespaced tagname)

Not only is this inconsistent, but they are not even equivelant. Attempting the third option on <dc:creator> returns an object instead of the expected typed value, for example. While this would without comparison appear to be a valid interpretation of RSS in some aspects, it comes at the cost of breaking the otherwise valid API and only works when such elements are actually present (of which is there is absolutely no guarantee).

If you look even deeper you also realise another subtle problem. Zend_Feed is incapable of correctly handling XML namespaces automatically. If two feeds use identical XML namespaces but different namespace prefixes (entirely legal XML behaviour), you would need to manually register/unregister the changing XML namespace prefixes for both respectively in order to have consistent parsing. This is particularly highlighted by any RSS 1.0 feeds (which are RDF) where Zend_Feed pre-registers an invalid RSS namespace (there is no RSS XML namespace except for RDF feeds) making parsing all but impossible.

All of the above contribute the one job any programmer wishes to avoid - writing additional support code. When using Zend_Feed every programmer will quickly require a good deal of extra work before it becomes truly useful on anything other than a few feeds whose condition is well known. It also means Zend_Feed will remain incapable of parsing some more exotic or older feeds which require specific knowledge.

So what is Zend_Feed_Reader?

After working with Zend_Feed for a while, I've built up some helper classes across various projects to make it more useful. In one current project (something akin to the Planetarium open source project) I've decided to refactor the existing source code and formalise it as a single component or, at the discretion of the current Zend_Feed maintainers, integrate it as a general refactoring and expansion of the existing Zend_Feed classes.

Zend_Feed_Reader would be an interpretive and possibly sanitising layer for Zend_Feed which is capable of understanding all RSS and Atom versions and condensing the data across all versions and extensions of both formats into a single Accessor API. I call it interpretive, because it understands RSS and Atom sufficiently to distinguish between versions, and accurately condense the data by querying for preferred data points. Sanitising refers to the possible inclusion of a validation layer (RSS/Atom data is common enough that it could be made internal using a filter chain) since RSS/Atom is, of course, userland data we must validate before use. Though this is of a lesser priority in this proposal and likely not a first version feature.

To offer a simple example. Assume you are aggregating two RSS feeds. The first is RSS 2.0 with a <content:encoded> element, and the second is an RSS 1.0 feed (RDF) with only an RSS namespaced <description> element. Using Zend_Feed at present you would need to search for any potential content elements one by one, and then decide which one is prefered. With Zend_Feed_Reader, which interprets all feeds and selects a preferred one internally, you could simply call Zend_Feed_Reader_Entry_Rss::getContent() which would return the text from <content:encoded> for the RSS 2.0 feed and from <description> for the RSS 1.0 feed without a content element. No fuss; no custom interpretive code. To boot, Zend_Feed_Reader can (if you need it - not unusual!) tell you what version of RSS is being interpreted since it's version aware.

That is probably the major benefit. One single predictable accessor API.

Possible Features In Brief:

  • Understand and interpret all versions of RSS and Atom
  • Automatically condense RSS/Atom data into a single preferred Accessor API
  • Automatically apply sanitisation on input (with the exception of HTML)
  • Allow opt-out use of a predetermined filter/validation chain
  • Allow optional setting of custom filters and validators
  • Support HTTP Conditional GET - Zend_Feed and HTTP Conditional GET
  • Support optional internal caching of all feeds until updated at source
  • Support current Zend_Feed querying via proxy
  • Expose the underlying parent DOMDocument object more clearly

Please note Zend_Feed_Reader, by definition, makes no changes to any aspect of creating feeds with Zend_Feed.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • MUST understand and interpret all versions of RSS and Atom
  • MUST automatically condense RSS/Atom data into a single preferred Accessor API
  • MIGHT automatically apply sanitisation on input (with the exception of HTML)
  • MIGHT allow opt-out use of a predetermined filter/validation chain
  • MIGHT allow optional setting of custom filters and validators
  • MUST support HTTP Conditional GET - Zend_Feed and HTTP Conditional GET
  • MUST support optional internal caching of all feeds until updated at source
  • MUST support current Zend_Feed querying via proxy
  • MUST expose the underlying parent DOMDocument object more clearly

4. Dependencies on Other Framework Components

Zend_Http_Client
Zend_Feed
Zend_Feed_Abstract
Zend_Cache

5. Theory of Operation

Operation can follow one of two strategies. The first is to reuse the existing Zend_Feed API. While this would promote reuse, it doesn't solve the problem of the inconsistent API and would instead become an abstract proxy to the original implementation. The second is to bypass the current API, while maintaining access to it by proxy, and making use of the underlying DOMDocument object representation of a feed to query against (e.g. using XPath). At present this proposal favours the second option since it reduces dependencies and allows greater flexibility by using DOMDocument directly to drive XPath queries (at a performance cost perhaps - nothing is for free ).

Operation would be quite simple. Zend_Feed_Reader would parent both Feed and Entry subclasses for RSS and Atom which contain sufficient logic to filter data from the underlying feeds. A user seeking the content from a batch of RSS and Atom feeds with unknown versions would only need one method: getContent(). Interpreting all feeds across all versions and locating the closest element match for an entry's content will be handled internally by Zend_Feed_Reader which can select a relevant "content" style element from the current feed when faced with many alternatives. Accessor methods would be available at both the Feed and Entry levels as appropriate. All methods would return literal values i.e. other than sanitisation no attempt would be made to manipulate values.

In addition to the unified API, Zend_Feed_Reader could be supplemented with plugin APIs for common extensions. For example, GeoRSS or Yahoo! Weather. These additional APIs could be derived from plugin calls: "$entry->georss()->getLatitude();".

If a separate component, Zend_Feed_Reader would include a local Zend_Feed_Abstract instance through composition and will encapulate access for all features intended to be internal (e.g. automated validation, caching and conditional fetches). Many of the above will require configuration options (either as an Array or Zend_Config instance).

Please refer to future use cases for additional operational examples.

6. Milestones / Tasks

  • Milestone 1: Select a strategy of Zend_Feed_Reader or refactoring Zend_Feed
  • Milestone 2: Complete initial component with unit tests
  • Milestone 3: Assemble a collection of common edge cases for resolution
  • Milestone 4: Verify that code operates under PHP 5.3
  • Milestone 5: Complete documentation

7. Class Index

  • Zend_Feed_Reader
  • Zend_Feed_Reader_Feed
  • Zend_Feed_Reader_Feed_Rss
  • Zend_Feed_Reader_Feed_Atom
  • Zend_Feed_Reader_Entry
  • Zend_Feed_Reader_Entry_Rss
  • Zend_Feed_Reader_Entry_Atom

Others may be determined during development

8. Use Cases

Use cases are currently being drafted. But here are a few samples. As usual, please refer to my public Zend Framework Proposals respository on Subversion which contains unit tests (as good as if not better than use cases).

UC-01

Detect the type and version of a feed (represented by a string of form TYPE-VERSION). The strings are also contained in string constants like Zend_Feed_Reader::TYPE_ATOM_03. Maybe I should split this across two additional methods for ease of use by end users.

UC-02

Get some feed information

UC-03

Get some entry information

UC-04

Access underlying Zend_Feed API (RSS)

UC-05

Enable cache (conditional fetches also automatically enabled). Set a sensible TTL - Zend_Feed_Reader will clear the cache of old content when any feed is updated.

9. Class Skeletons

Zend_Feed_Reader
Zend_Feed_Reader_Entry_Interface (Atom, Dc and RSS Entry classes implement this)
Zend_Feed_Reader_Feed_Interface (Atom, Dc and RSS Feed classes implements this)
Zend_Feed_Reader_Feed_Abstract

]]></ac:plain-text-body></ac:macro>

]]></ac:plain-text-body></ac:macro>

Labels:
feed feed Delete
atom atom Delete
rss rss Delete
zend_feed zend_feed Delete
proposals proposals Delete
zend_webservices zend_webservices Delete
xml xml Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Aug 01, 2008

    <p>As I said on IRC: when can I have it?! <ac:emoticon ac:name="laugh" /><br />
    The second option of bypassing the current API would seem the best way to go.<br />
    I'll do a checkout of your SVN right now and update it every once in a while ^^</p>

  2. Aug 05, 2008

    <p>So far the code is focused on the bypass option. Far be it from me to complicate matters <ac:emoticon ac:name="wink" />, the implementation is largely concerned with generating and assessing XPath queries on the original DOMDocument instance. While this adds a bit to the code length, it does correctly switch across all RSS and Atom versions and all implemented namespace modules in a simple fashion without any confusion. If for some reason XPath becomes a concern it would be a simple matter to use DOM traversal instead (and that would be merely an implementation concern as the unit tests would not alter a whit).</p>

    <p>The current unit tests are largely concerned with API direction/selection of nodes rather than post-processing (for example, reversing encoding/entities) but since such post-processing would require far fewer focused tests the current collection is probably the most important. They're also the most numerous. I'm guessing there could easily be ~800 discrete tests and possibly no less than ~700 once obvious duplicates are removed. Says a lot that you need that number of tests to execute all the possible branches interpreting RSS/Atom can take... Currently the RSS functionality has ~350 tests to date with lots more on the way.</p>

  3. Aug 15, 2008

    <p>Paste from IRC:</p>

    <p>Elazar: 1) Zend_Feed_Pipe - Programmatically emulates applicable functionality of Yahoo! Pipes in Zend Framework as a way to easily manipulate feeds. Provides functionality to consolidate feeds, filter, sort, slice, manipulate fields, etc. using Zend_Dom. Zend_Feed_Reader would definitely be useful here. I'm wondering if the overall concept behind Yahoo! Pipes could be extended beyond feeds to an extent that<br />
    <ac:link><ri:page ri:content-title="07pm" ri:space-key="11" /></ac:link> Elazar: Zend Framework could support it in other areas?<br />
    <ac:link><ri:page ri:content-title="07pm" ri:space-key="11" /></ac:link> Elazar: 2) Zend_Feed_Cache - Uses Zend_Cache for storage. To update the cache, iterates over a list of cached feeds and uses Zend_Http_Client to perform HEAD requests to check for updates to each and GET requests to actually retrieve updated feed content.</p>

  4. Aug 15, 2008

    fc

    <p>I love this proposal. Can't wait to see it in action.</p>

  5. Oct 29, 2008

    <ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Comments</ac:parameter><ac:rich-text-body>
    <p>This proposal is approved for incubator development, provided that the following issues are addressed:</p>
    <ul>
    <li>Proposal should provide interfaces within Class Skeletons section (not only using provided reference to existing code).</li>
    <li>Corresponding objects should provide iterating interfaces.</li>
    <li>That whould be good to think about Zend_Feed_Writer to have possibility to replace existing Zend_Feed implementation.</li>
    <li>Appropriated Zend_Feed methods should provide access to Zend_Feed_Reader implementation (e.g. Zend_Feed::getReader(...);</li>
    </ul>
    </ac:rich-text-body></ac:macro>

  6. Nov 17, 2008

    <p>Hi Alexander,</p>

    <p>Unfortunately I missed the acceptance by two weeks <ac:emoticon ac:name="smile" />. I think my email filter missed the notification or something. A few responding comments (Jurrien will chime in later no doubt - if you haven't already caught him over IRC):</p>

    <p>*Proposal should provide interfaces within Class Skeletons section (not only using provided reference to existing code).</p>

    <p>With respect, applying TDD is not conducive to a fixed interface at our present early stage (where we're in the middle of a refactoring process) but we are maturing steadily towards a standard interface which Jurrien has added, and which hopefully reflects what we'll see at the end of the day. Just take it with a tiny pinch of salt.</p>

    <p>*Corresponding objects should provide iterating interfaces.</p>

    <p>Could you expand? I think I understand what you mean but clarity makes Paddy's world revolve more steadily <ac:emoticon ac:name="smile" />.</p>

    <p>*That whould be good to think about Zend_Feed_Writer to have possibility to replace existing Zend_Feed implementation.</p>

    <p>I'll discuss the prospect with Jurrien, but for now the Reader alone is a huge undertaking. If we take that route we'll be sure to propose it separately. It would surely be simpler than Zend_Feed_Reader! <ac:emoticon ac:name="smile" /> Anything would!</p>

    <p>*Appropriated Zend_Feed methods should provide access to Zend_Feed_Reader implementation (e.g. Zend_Feed::getReader(...);</p>

    <p>This would require Zend_Feed alterations (not a current goal) but we will look into this and propose changes directly to you or whoever else is currently responsible for Zend_Feed. Part of the reason Reader is not a direct Zend_Feed refactoring was to isolate the changes outside of Zend_Feed itself to minimise the risk to backwards compatibility. This also means the inheritance direction puts Zend_Feed below the level where it would be aware of Reader - pushing that awareness further down the chain to Zend_Feed needs to be considered but should be possible if we maintain the current trend of keeping a real Zend_Feed object handy within Zend_Feed_Reader. I would not expect that awareness to extend to Zend_Feed manipulating Zend_Feed_Reader - it should be a one way street in the other direction as far as possible. Perhaps when ZF 2.0 is hitting the stage a fuller refactoring could be considered.</p>

    <p>The review is much appreciated, and thanks for allowing us to continue development within the Incubator! <ac:emoticon ac:name="smile" /></p>

    1. Nov 29, 2008

      <p>Hmm.. I must've missed the email about this comment. Better late than never to comment.</p>

      <p>Adding a getReader() method to Zend_Feed should be relatively simple. I guess we could have a look at that once the Zend_Feed_Reader component is done. If nothing else, it could just be a simple factory method. We should not depend on Zend_Feed too much though. Both Zend_Feed_Reader and a possible Zend_Feed_Writer should just use Zend_Feed for some very basic things. Ideally the current Zend_Feed implementation will be deprecated and removed in favor of Zend_Feed_(Reader|Writer) in ZF 2.0.</p>

      <p>As you might have guessed, I'm a proponent of Zend_Feed_Writer. As Paddy already wrote, we can have a look at such a component after Zend_Feed_Reader is done.</p>

      <p>Let ZF become famous for its excellent feed support! <ac:emoticon ac:name="wink" /></p>

  7. Mar 18, 2009

    <p>Guys, what is the status of this proposal? I'd love to see it finished.</p>

    1. Mar 18, 2009

      <p>Atom support and RSS (and DC?) support are pretty much finished (working and tested). Paddy was going to do some work on the edge-cases with mixed RSS/Atom/DC feeds, after which a good refactoring is in order. Paddy should probably take the lead in that, seeing as he's the most familiar with his plans. <ac:emoticon ac:name="wink" /></p>