Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend\Html\Filter Component Proposal

Proposed Component Name Zend\Html\Filter
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend\Html\Filter
Proposers Pádraic Brady
Zend Liaison TBD
Revision 0.8 - 9 August 2010: Initial Draft. (wiki revision: 2)

Table of Contents

1. Overview

Zend\Html\Filter is a generic DOM filter for performing HTML sanitisation and other HTML manipulations. While its focus is obviously HTML sanitisation, it uses a filter queue approach that is also suitable for dynamically altering HTML via the PHP DOM with reusable filter objects. This generic approach allows for other uses such as replacing, altering or adding elements, attributes and custom styling to the HTML being filtered.

This component is being proposed to Zend Framework 2.0 to ensure that Zend Framework developers have access to HTML sanitisation features that do not rely on external libraries or require mixing previously established classes of limited scope such as Zend_Filter_StripTags, an approach that is often fundamentally flawed and unnecessarily complex. While there is nothing wrong per se with using external libraries, Zend\Html\Filter was designed not simply as a handy Zend Framework component but as a full featured sanitiser which improves on existing external libraries' characteristics in both security and performance. The goal of Zend\Html\Filter is to become the best performing fully featured HTML sanitiser in PHP. It will be made available outside of the Zend Framework (for non-framework users) as the Wibble HTML Sanitiser.

Having a HTML sanitiser in the Zend Framework would fill a large gap in its overall security environment. At present applications often end up consuming third-party HTML from sources such as RSS and Atom feeds, emails, web scraping, web APIs, blog comments, WYSIWYG editors, etc. It has always been essential that developers sanitise such input to ensure it does not introduce Cross-Site Scripting (XSS) and Phishing vulnerabilities, as well as ensuring that the resulting output meets the requirements of an acceptable HTML standard and is free from obvious attempts to break normal page layout.

At present, only HTMLPurifier is capable of all these tasks and Zend Framework developers have required custom filters and classes in order to integrate it for Zend Framework applications. All other alternatives to HTMLPurifier have been found to suffer from insecure behaviour, missing sanitisation coverage and a lack of dependable HTML tidying. These findings are based on my own examination of HTMLPurifier and its alternatives: LINK. While this analysis may be biased in some respects (since I am a hardcore HTMLPurifier user), its factual findings are quite clear.

Zend\Html\Filter is intended to closely match HTMLPurifier's capabilities but at a significantly reduced cost in terms of performance. This is achieved by offloading complex HTML parsing and tidying to the PHP DOM and HTML Tidy extensions. It also improves performance by using DOM filters to strip, escape or prune HTML content considered dangerous. Performing this via DOM methods is often both faster and more reliable than doing so with regular expressions. This does not mean that Zend\Html\Filter will outperform common regular expression based HTML sanitisers in all scenarios. However, the more complex the task the more likely it is that regular expression processing will become a performance bottleneck, one that Zend\Html\Filter does not encounter.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

Most requirements take the form of "foo will do ...." or "foo will not support ...", although different words and sentence structure might be used. Adding functionality to your proposal is requirements creep (bad), unless listed below. Discuss major changes with your team first, and then open a "feature improvement" issue against this component.

  • This component will correctly reads a developers mind for intent and generate the right configuration file.
  • The generated config file will not support XML, but will provide an extension point in the API.
  • This component will use no more memory than twice the size of all data it contains.
  • This component will include a factory method.
  • This component will not allow subclassing. (i.e. when reviewed, we expect to see "final" keyword in code)
  • This component will only generate data exports strictly complying with RFC 12345.
  • This component will validate input data against formats supported by ZF component Foo.
  • This component will not save any data using Zend_Cache or the filesystem. All transient data will be saved using Zend_Session.

4. Dependencies on Other Framework Components

  • Zend_Exception

5. Theory of Operation

Zend\Html\Filter operates as a DOM filter queue with a HTML Tidy postprocessor. In effect, it follows the following process:

1. Converts all HTML input to UTF-8
2. Loads the HTML into a DOMDocument object
3. Applies one or more filters (DOM manipulators) to the HTML DOM
4. Extracts the filtered HTML from DOM and applies HTML Tidy (ext/tidy)
5. Converts the final HTML to the user's selected character encoding (if not UTF-8)

Many of these stages are self-explanatory, so the remainder of this section concern the filters.

The filters used by Zend\Html\Filter are basically classes which apply DOM operations (using the normal PHP 5 DOM API). This allows filters to add, remove or alter any part of the DOM. The reliance on DOM affords developers a common well understood API for manipulating HTML without resorting to regular expressions. Filters may be applied one after another as a series of filter operations (these are called explicitly and no internal queue is actually constructed).

By default, Zend\Html\Filter bundles filters named Strip, Escape, Prune and Cull for use in various settings for HTML sanitisation. The Strip filter is enabled by default and, surprise, strips out any HTML considered dangerous from an XSS or Phishing perspective. While Zend\Html\Filter includes a number of whitelists which may be used by these filters, the Strip filter assumes by default that all HTML elements must be stripped.

All of the bundled filters also make use of a set of utility methods dedicated to applying any whitelists and sanitisation to key parts of the HTML contained in a DOMDocument. Whitelisting prevents the occurance of elements or attributes which are either illegal or not safe for output. The sanitisation prevents the occurance of attribute values and CSS values which are likewise illegal or unsafe. This part of Zend\Html\Filter does utilise regular expressions though they are not subject to being fooled by malformed string input (as is the major risk with purely regular expression driven sanitisers). The regular expressions are "borrowed" from several open source HTML sanitisation and parsing libraries from outside PHP. This can be perceived as an odd decision, however it is foolhardy to reinvent the wheel for a handful of one line regular expressions already widely deployed and tested in multiple other languages outside of PHP. All sanitisation routines operate on a zero tolerance basis by removing failing attribute or CSS values instead of attempting to manipulate them into compliance.

Finally, once all sanitisation/filtering has been performed, the DOMDocument output is passed through ext/tidy. This postprocessing stage ensures that the output HTML is well formed and complies to a user selected HTML standard. While this stage is absolutely necessary to guarantee well formed output, it may be disabled via an option (an exception is thrown by default on systems not supporting ext/tidy). Zend\Html\Filter was specifically tested to ensure that disabling HTML Tidy did not impact on the efficacy of its sanitisation (many sanitisation related unit tests were performed under both conditions).

6. Milestones / Tasks

Describe some intermediate state of this component in terms of design notes, additional material added to this page, and / code. Note any significant dependencies here, such as, "Milestone #3 can not be completed until feature Foo has been added to ZF component XYZ." Milestones will be required for acceptance of future proposals. They are not hard, and many times you will only need to think of the first three below.

  • Milestone 1: design notes will be published here
  • Milestone 2: Working prototype checked into the incubator supporting use cases #1, #2, ...
  • Milestone 3: Working prototype checked into the incubator supporting use cases #3 and #4.
  • Milestone 4: Unit tests exist, work, and are checked into SVN.
  • Milestone 5: Initial documentation exists.

If a milestone is already done, begin the description with "[DONE]", like this:

  • Milestone #: [DONE] Unit tests ...

7. Class Index

  • Zend_Magic_Exception

8. Use Cases

UC-01

... (see good use cases book)

9. Class Skeletons

zone: Missing {zone-data:skeletons}
]]></ac:plain-text-body></ac:macro>

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.