This line was removed.
This word was removed. This word was added.
This line was added.
Changes (9)View Page History
0.9 - 10 August 2010: Initial Draft.
1.0 - 10 August 2010: Initial Version.
Having a HTML sanitiser in the Zend Framework would fill a large gap in its overall security environment. At present applications often end up consuming third-party HTML from sources such as RSS and Atom feeds, emails, web scraping, web APIs, blog comments, WYSIWYG editors, etc. It has always been essential that developers sanitise such input to ensure it does not introduce Cross-Site Scripting (XSS) and Phishing vulnerabilities, as well as ensuring that the resulting output meets the requirements of an acceptable HTML standard and is free from obvious attempts to break normal page layout.
At present, only HTMLPurifier is capable of all these tasks and Zend Framework developers have required custom filters and classes in order to integrate it for Zend Framework applications. All other alternatives to HTMLPurifier have been found to suffer from insecure behaviour, missing sanitisation coverage and a lack of dependable HTML tidying. These findings are based on my own examination of HTMLPurifier and its alternatives: LINK. [HTML Sanitisation: The Devil's In The Details (And The Vulnerabilities)|http://blog.astrumfutura.com/archives/431-HTML-Sanitisation-The-Devils-In-The-Details-And-The-Vulnerabilities.html]. While this analysis may be biased in some respects (since I am a hardcore HTMLPurifier user), its factual findings are quite clear. Most HTML sanitisers are at present not all that secure.
Zend\Html\Filter is intended to closely match HTMLPurifier's capabilities but at a significantly reduced cost in terms of performance. This is achieved by offloading complex HTML parsing and tidying to the PHP DOM and HTML Tidy extensions. It also improves performance by using DOM filters to strip, escape or prune HTML content considered dangerous. Performing this via DOM methods is often both faster and more reliable than doing so with regular expressions. This does not mean that Zend\Html\Filter will outperform common regular expression based HTML sanitisers in all scenarios. However, the more complex the task the more likely it is that regular expression processing will become a performance bottleneck, one that Zend\Html\Filter does not encounter, bringing their performance closer together. Rough benchmarks for the Wibble prototype that this proposal is based on can be found at: [HTML Sanitisation Benchmarking With Wibble (ZF Proposal)|http://blog.astrumfutura.com/archives/430-HTML-Sanitisation-Benchmarking-With-Wibble-ZF-Proposal.html].
Zend\Html\Filter operates as a DOM filter queue with a HTML Tidy postprocessor. In effect, it follows the following process:
1. Converts all HTML input to HTML safe UTF-8
2. Loads the HTML into a DOMDocument object
3. Applies one or more filters (DOM manipulators) to the HTML DOM
3. Applies one or more filters (DOM manipulators) to the HTML DOM
The filters used by Zend\Html\Filter are basically classes which apply DOM operations (using the normal PHP 5 DOM API). This allows filters to add, remove or alter any part of the DOM. The reliance on DOM affords developers a common well understood API for manipulating HTML without resorting to regular expressions. Filters may be applied one after another as a series of filter operations (these are called explicitly and no internal queue is actually constructed).
By default, Zend\Html\Filter bundles filters named Strip, Escape, Prune and Cull for use in various settings for HTML sanitisation. The Strip filter is enabled by default and, surprise, strips out any HTML considered dangerous from an XSS or Phishing perspective. While Zend\Html\Filter includes a number of (provisional - to be reviewed) whitelists which may be used by these filters, the Strip filter assumes by default that all HTML elements must be stripped.
All of the bundled filters also make use of a set of utility methods dedicated to applying any whitelists and sanitisation to key parts of the HTML contained in a DOMDocument. Whitelisting prevents the occurance of elements or attributes which are either illegal or not safe for output. The sanitisation prevents the occurance of attribute values and CSS values which are likewise illegal or unsafe. This part of Zend\Html\Filter does utilise regular expressions though they are not subject to being fooled by malformed string input (as is the major risk with purely regular expression driven sanitisers). Zend\Html\Filter not only normalises all input to UTF-8, it also then removes UTF-8 characters which are recommended to be avoided (for example, it would remove half-width characters which are interpreted by IE6 and capable of use as XSS vectors). The regular expressions are "borrowed" from several open source HTML sanitisation and parsing libraries from outside PHP. This can be perceived as an odd decision, however it is foolhardy to reinvent the wheel for a handful of one line regular expressions already widely deployed and tested in multiple other languages outside of PHP. All sanitisation routines operate on a zero tolerance basis by removing failing attribute or CSS values instead of attempting to manipulate them into compliance.
Finally, once all sanitisation/filtering has been performed, the DOMDocument output is passed through ext/tidy. This postprocessing stage ensures that the output HTML is well formed and complies to a user selected HTML standard. While this stage is absolutely necessary to guarantee well formed output, it may be disabled via an option (an exception is thrown by default on systems not supporting ext/tidy). Zend\Html\Filter was specifically tested to ensure that disabling HTML Tidy did not impact on the efficacy of its sanitisation (many sanitisation related unit tests were performed under both conditions) of XSS and Phishing vulnerabilities though it may allow other page breaking tactics (hence the need to explicitly choose to disable HTML Tidy).
Finally, it is important to note that Zend\Html\Filter cannot help but not support the entire scope of HTML 5. Neither libxml2 (the basis of PHP DOM) or HTML Tidy understand HTML 5 at the time of this proposal. There are HTML5 elements and other valid syntax which would be stripped from HTML 5 input. This does not mean that you can't filter a typical HTML 5 snippet and include it into HTML 5 - the limitations largely surround new HTML elements introduced with HTML 5 such as, for example, the section and time elements.