Skip to end of metadata
Go to start of metadata

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Filter_CharacterEntityEncode & Zend_Filter_CharacterEntityDecode Component Proposal

Proposed Component Name Zend_Filter_CharacterEntityEncode & Zend_Filter_CharacterEntityDecode
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_CharacterEntityEncode & Zend_Filter_CharacterEntityDecode
Proposers Marc Bennewitz
Zend Liaison TBD
Revision 0.1 - 24. Oct 2009: Initial Draft.
1.0 - 03. December 2010: Archived (wiki revision: 18)

Table of Contents

1. Overview

The encoder is a simple and full configurable filter to encode characters to its entities.
The decoder is the opposite filter to decode entities to its characters.

Moved to GitHub
This proposal was moved to GitHub
-> http://github.com/marc-mabe/EntityCoder

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This component will provide a complete and good configurable interface to encode and decode texts with entities.
  • This component will not replace Zend_Filter_HtmlEntities.
  • This component will use iconv to convert character-sets
  • This component will not handle special CDATA content (like the content of <script></script>)

4. Dependencies on Other Framework Components

  • Zend_Filter
  • Zend_Exception

5. Theory of Operation

The encoder converts the complete text from intput character set to UTF-8 and replaces only characters which aren't available by given output character set with a named entity given by user or by numeric or hex entity. After this the text will reconverted to output character set.

The decoder converts all entities (named by user entity reference, numeric and hex) to its equipollent by given character sets. If an entity can't convert to the charset the configured action will be used (exception, translit, ignore, entity, substitute). Furthermore it is configurable if the special chars (&,<,>,",') must keep.

6. Milestones / Tasks

  • Milestone 1: [DONE] Finish proposal
  • Milestone 2: [DONE] Working prototype
  • Milestone 3: Prototype checked into the incubator
  • Milestone 4: Unit tests exist finished and component is working
  • Milestone 5: Initial documentation exists
  • Milestone 6: Changed related components
  • Milestone 7: Moved to core.

7. Class Index

  • Zend_Filter_CharacterEntityEncode
  • Zend_Filter_CharacterEntityDecode
    or
  • Zend_Filter_EntityEncode
  • Zend_Filter_EntityDecode

8. Use Cases

UC-01 - Rich-Text-Editor

You get a html formated text from a rich text editor in ISO-8859-1 and need to convert it to UTF-8.
-> This example converts the text to UTF-8 and converts all entities to UTF-8 but special chars ",',<,>,&.
-> You will get a valid html formated string with a minimum of entities.

UC-02 - Convert HTML to plain
UC-03
UC-04

9. Class Skeletons

]]></ac:plain-text-body></ac:macro>

]]></ac:plain-text-body></ac:macro>

Labels:
filter filter Delete
entity entity Delete
html html Delete
xml xml Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Mar 07, 2010

    <p>Looks nice. Great replacement to htmlspecialchars() and htmlentities(). Hope it gets into ZF soon.</p>

    1. Mar 07, 2010

      <p>Thanks for your interest.</p>

      <p>It's not a replacement of the html entities filter, it's an addition.<br />
      -> htmlentities and friends is more performance than this but this adds some functionalities.</p>

  2. Jun 03, 2010

    <p>Regarding class naming, how about HtmlEntityDecode instead of CharacterEntityDecode? This would have better parity with the existing Zend_Filter_HtmlEntities.</p>

    <p>Are you planning to replace the existing HtmlEntities class? That would be a significant break in backwards compatibility. The alternative of having two html encoding filters seems like it would be too confusing. </p>

    <p>Should we drop the encoding class from the proposal?</p>

    1. Jun 03, 2010

      <blockquote>
      <p>Regarding class naming, how about HtmlEntityDecode instead of CharacterEntityDecode? This would have better parity with the existing Zend_Filter_HtmlEntities.</p></blockquote>
      <p>HTML would be wrong because this is designed to handle xml and html entities. My current implementation is simply EntityEncode & EntityDecode <br class="atl-forced-newline" /></p>
      <blockquote>
      <p>Are you planning to replace the existing HtmlEntities class? That would be a significant break in backwards compatibility. The alternative of having two html encoding filters seems like it would be too confusing.</p></blockquote>
      <p>No, because it's using simply the build in function which is much faster and in most cases you don't need extra functionalities.</p>

      1. Jun 03, 2010

        <p>I see your point about HtmlEntityDecode being 'wrong' given the support for XML entities. However, since XHTML supports XML entities, I'm not sure the name is entirely incorrect.</p>

        <p>It seems to me that having two filters that handle (X)HTML entities is preferable to having three. E.g.</p>

        <p> HtmlEntities<br />
        HtmlEntityDecode</p>

        <p>vs.</p>

        <p> HtmlEntities<br />
        EntityEncode<br />
        EntityDecode</p>

        1. Jun 04, 2010

          <ul>
          <li>Filters using only a build-in function should be named as the functions.</li>
          <li>EntityEncode will add some functionalities - why you would skip it ?</li>
          <li>HtmlEntityDecode could make sense to have a fast decoder simply using html_entity_decode<br />
          Than we have 4 filters:
          <ul>
          <li>HtmlEntities</li>
          <li>HtmlEntityDecode</li>
          <li>EntityEncode</li>
          <li>EntityDecode</li>
          </ul>
          </li>
          </ul>

          1. Jun 05, 2010

            <p>You wouldn't have to skip the EntityEncode functionality, you could add it to the existing HtmlEntities as additional options. Having four filters works too, just confusing for people, IMO.</p>

  3. Jun 03, 2010

    <p>I have a few questions regarding some of the inner-workings of the entity decoder:</p>

    <p>1. The proposal mentions conversion of numeric and hex entities. Will this be limited to the 252-253 standard entities, or will it support the full range of unicode codepoints?</p>

    <p>2. When converting named entities, could we add support for being case-insensitive for entities where the case does not matter (such as   &NBSP?</p>

    <p>3. Can you use the html_translation_table() function instead of enumerating the html 4 entities? Or is it insufficient?</p>

    <p>4. I think your class constants violate the coding standards (words should be underscore separated). How about INVALID_CHAR_EXCEPTION instead of ONILLEGALCHAR_EXCEPTION?</p>

    <p>Cheers,<br />
    Stew</p>

    1. Jun 04, 2010

      <p>1. There isn't a limit -> the full range of unicode.<br />
      2. Named entities are case sensitive ä <> Ä and something like &AUML; doesn't exist but you can define you own entity!<br />
      3. I defined the most needed entity references html4 / xml / special as an static array and you can overwrite these tables.<br />
      4. You are right - I couldn't found a good name for it - INVALID_CHAR_* looks a bit nicer <ac:emoticon ac:name="wink" /><br />
      (or we define only EXCEPTION, TRANSLIT etc.)</p>

      1. Jun 05, 2010

        <p>That's great!</p>

        <p>We should have an option to make entities case-insensitive when the case doesn't matter. Yes, entities are case-sensitive, but often they are used incorrectly. Webkit and Gecko support this behavior.</p>

        <p>If an entity varies by case (e.g. aring vs Aring), then we are case-sensitive for that entity. If the entity does not vary by case (e.g. amp) then we should accept both upper and lower case variations (e.g. AMP).</p>

        1. Jun 30, 2010

          <p>Hi Marc, any objection to adding this loose-case matching feature to the proposal? It's an important feature for my use case.</p>

        2. Jul 01, 2010

          <blockquote><p>If an entity varies by case (e.g. aring vs Aring), then we are case-sensitive for that entity. If the entity does not vary by case (e.g. amp) then we should accept both upper and lower case variations (e.g. AMP).</p></blockquote>
          <p>Because you can define your own entities it's not possible to know that & could be case insensitive.</p>

          <p>My idea would be to allow calling a callback for unknown entities like the following:</p>
          <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
          $decoder->setUnknownEntityCallback(function ($entity) {
          // your code to handle unknown entities
          });
          $decoder->filter('&&InvalidEntity;');
          ]]></ac:plain-text-body></ac:macro>
          <p>-> The callback is called for every unknown entity<br />
          -> The return value is the replace of the unknown entity</p>

          <p>Than you can define your own actions.</p>

          1. Jul 01, 2010

            <blockquote><p>Because you can define your own entities it's not possible to know that & could be case insensitive.</p></blockquote>

            <p>The way that I have implemented it, the table of entities is examined to determine which entities appear only once and which entities appear more than once with case variations. This approach should be compatible with user-defined entities.</p>

            <p>If the user defines a new entity 'foobar' and loose case matching is enabled, we will replace foobar or FOOBAR. If two variations of foobar appear, or if loose case matching is disabled, we will only replace foobar.</p>

            <blockquote><p>My idea would be to allow calling a callback for unknown entities like the following</p></blockquote>

            <p>A hook for the user to implement the feature is not as helpful as implementing the feature. Loose matching of entities is now a precedent that has been established by the popular web browsers (gecko, webkit, etc). It is a feature that I need and I suspect others will appreciate it also.</p>

            <p>I would argue that it should be on by default, but either way I feel strongly that it should be built-in.</p>

            1. Jul 02, 2010

              <blockquote><p>The way that I have implemented it, the table of entities is examined to determine which entities appear only once and which entities appear more than once with case variations. This approach should be compatible with user-defined entities.</p></blockquote>
              <p>This sounds slow!</p>

              <blockquote><p>A hook for the user to implement the feature is not as helpful as implementing the feature. Loose matching of entities is now a precedent that has been established by the popular web browsers (gecko, webkit, etc). It is a feature that I need and I suspect others will appreciate it also.</p></blockquote>
              <p>Sure, but browsers are not libraries <ac:emoticon ac:name="wink" /> If we implement such a feature (to detect unknown entities) the user should define his right action he wont.</p>

              <p>With such a callback you can implement a case-insensitive behavior very simple:</p>
              <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
              $map = $decoder->getEntityReference();
              $decoder->setUnknownEntityCallback(function ($entity) use ($map) {
              $lower = strtolower($entity);
              if (isset($map[$lower]))

              Unknown macro: { return $map[$lower]; }

              else

              Unknown macro: { // remove unknown entity return ''; }

              });
              ]]></ac:plain-text-body></ac:macro>

              <blockquote><p>I would argue that it should be on by default, ...</p></blockquote>
              <p>NO! It's only an addition.</p>

              1. Jul 02, 2010

                <blockquote><p>This sounds slow!</p></blockquote>

                <p>It really shouldn't be. Especially relative to the expense of performing text translation. Recall, I am talking about examining the table of named entities which only contains around 250 entries. In any case, performance speculation is not a good reason to dismiss this feature.</p>

                <blockquote><p>Sure, but browsers are not libraries If we implement such a feature (to detect unknown entities) the user should define his right action he wont.</p></blockquote>

                <p>I see your point. The two features are in some conflict which raises the question of what the right thing to do is. If loose-case matching is enabled, we would be effectively saying that the entity is known and therefore we shouldn't call the unknown entity callback for that entity.</p>

                <blockquote><p>With such a callback you can implement a case-insensitive behavior very simple:</p></blockquote>

                <p>The callback feature is very cool, but this function doesn't quite capture the behavior I am talking about. It is a bit more complicated than this. Recall that we only want to be case-insensitive for entities where the case does not matter. For example, we don't won't tolerate case mismatch for 'aring', but we will for amp, nbsp, etc.</p>

                <blockquote><p>NO! It's only an addition.</p></blockquote>

                <p>Ok, but you agree that I can add it?</p>

                1. Jul 02, 2010

                  <p>do you have some sample code?</p>

                  1. Jul 03, 2010

                    <p>Sure, here is how I am doing named entity decoding. You can see that we walk over the entity table and support both upper and lower case variations of entities that appear only once in the table.</p>

                    <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
                    $mapping = array();
                    $entities = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
                    $entityCounts = array_count_values(array_map('strtolower', $entities));
                    foreach ($entities as $character => $entity) {

                    // translated character should be in requested charset
                    // using iconv for it's better charset support.
                    $character = html_entity_decode($entity, ENT_QUOTES, self::UTF8);
                    $character = iconv(self::UTF8, $this->_charset, $character);

                    $mapping[$entity] = $character;

                    // some entities vary by case (e.g. &aring, &Aring), if this entity
                    // has only one entry, support both upper and lower-case variations.
                    if ($entityCounts[strtolower($entity)] == 1)

                    Unknown macro: { $mapping[strtoupper($entity)] = $character; $mapping[strtolower($entity)] = $character; }

                    }

                    // do decoding of named entities.
                    $value = str_replace(array_keys($mapping), array_values($mapping), $value);
                    ]]></ac:plain-text-body></ac:macro>

                    1. Jul 04, 2010

                      <p>Now I tested the behavior with FF 3.6.6 and IE 8 and I saw that these browsers doesn't implement a fault back! -> Unknown entities are displayed as as the entities.</p>

                      <p>Therefore I don't no why it should be implemented.<br />
                      1. not a standard<br />
                      2. not a quasi standard (behavior of most browsers)</p>

                      <p>And your behavior it a quite special, too. For example invalid entities like "&NbSp;" will not be detected.</p>

                      <p>My next point is: You can implement your own behavior using the callback functionality.</p>

                      <p><ac:link><ri:page ri:content-title="EDIT" /></ac:link>: FF 2.6.6 -> FF 3.6.6</p>

                      1. Jul 04, 2010

                        <p>We're seeing two different things. I just tested again under FF 3.5.10, Safari 5.0, Chrome 5.0 and IE 8. I used the following markup:</p>

                        <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
                        <html>
                        <body>
                        & & &Amp;
                        </body>
                        </html>
                        ]]></ac:plain-text-body></ac:macro>

                        <p>All four browsers demonstrate the same behavior. The first two entities are honored, and the third entity is ignored (displayed as-is). This matches the decoding behavior of my example code. The 'special' behavior you are referring to is actually quite intentional.</p>

                        <p>This is most definitely a quasi-standard, and we should provide support for it.</p>

                        1. Jul 04, 2010

                          <p>hehe please use the following test:</p>
                          <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
                          <html>
                          <body>

                          &<br />
                          &<br />
                          &Amp;<br />
                          <br />

                           <br />
                          &NBSP;<br />
                          &NbSp;<br />
                          <br />

                          ¶<br />
                          &PARA;<br />
                          &PaRa;<br />
                          <br />

                          å<br />
                          Å<br />
                          &ARING;<br />

                          </body>
                          </html>
                          ]]></ac:plain-text-body></ac:macro>

                          <p>You will note that only the amp will be detected. This is because often users doesn't escape or escape this entity wrong and this will break links <ac:emoticon ac:name="wink" /></p>

                          1. Jul 05, 2010

                            <p>Interesting. Looks like the case-insensitivity only applies to a limited set of entities: copy, reg, quot, lt, gt, and amp. Not sure what the rationale behind that is, but clearly my implementation is incorrect. It would be nice to match this behavior, but I also discovered that browsers will decode entities that have no trailing semi-colon and I don't think it will be practical to match all of these kinds of peculiarities.</p>

                            1. Jul 06, 2010

                              <p>It sounds like a html auto repair like tidy.</p>

                              <p>This should be done on an other filter and there is already work on it:
                              <a class="external-link" href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Html+-+Thomas+Weidner">http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Html+-+Thomas+Weidner</a></p>

                              <p>This doesn't mean that the behavior of unknown entities should be configurable but an entity auto repair shouldn't part of this filter.</p>

  4. Jun 04, 2010

    <p>One thing to note:</p>

    <p>There is already a Zend_Filter_Encode and a Zend_Filter_Decode I am working on.<br />
    There are several adapters/extensions for them like Base64, Json, Punycode and Url.</p>

    <p>Additionally I am actually working on a Zend_Filter_TwoWay (as requested) which enables two-way-filtering.</p>

    <p>As your proposal enables encoding as also decoding the following would perfect fit in my eyes:<br />
    Zend_Filter_Encode_HtmlEntity (implies Zend_Filter_Decode_HtmlEntity)<br />
    Zend_Filter_Encode_Entity (implies Zend_Filter_Decode_Entity)</p>

    1. Jun 04, 2010

      <p>Thanks for the info!</p>

      <p>yea, an own namespace for encode/decode filters makes sense.<br />
      -> I looked at your code on incubator (hope it's the right place) but there are some parts I don't understand. Do you have proposal for it to discuss about?</p>

  5. Aug 03, 2010

    <ac:macro ac:name="note"><ac:rich-text-body><p><strong>Community Review Team Recommendation</strong></p>

    <p>The ZF CR-Team has postponed a decision on this proposal and requests that the proposer amend their proposal to answer the following questions:</p>

    <ul>
    <li>What is the purpose and objective of these components?</li>
    <li>Why are these components needed for the Zend Framework?</li>
    <li>Why is the implementation performed character by character?</li>
    <li>What are the security implications (if any) of the current implementation?</li>
    </ul>

    <p>If the proposer wishes, they may contact the CR-Team on #zftalk.dev to discuss these questions.</p></ac:rich-text-body></ac:macro>

    1. Aug 05, 2010

      <blockquote>
      <ul>
      <li>What is the purpose and objective of these components?</li>
      <li>Why are these components needed for the Zend Framework?</li>
      </ul>
      </blockquote>
      <ul>
      <li>HTML is often used to format texts.<br />
      Some webservices links to html formated text files (without head). There are often entities used to encode specific characters not supported by the used character set.<br />
      -> If you use such a file you have to convert it in your needed char-set and to minify it you can convert entities if it's supported by your char-set or if you don't allow entities you can remove all unsupported entities.</li>
      </ul>

      <ul>
      <li>usage of characters not supported within used character set:<br />
      Some companies don't use UTF-8 (history cause) and often use entities to write unsupported characters. (related to first note)</li>
      </ul>

      <ul>
      <li>editable div's<br />
      Rich text editors are using an editable div element which send content as html formated text, too. If you need to convert it you have to handle all contained entities.</li>
      </ul>

      <ul>
      <li>htmlentities and friends only works with a fixed entity reference/table:<br />
      Sometimes you need to encode/decode non standard entities which isn't possible with the build-in php functions and without usage of DOM. <br />
      Additionally you can't configure htmlentities and friend to use hex-,decimal or named entities.</li>
      </ul>

      <blockquote>
      <ul>
      <li>Why is the implementation performed character by character?</li>
      </ul>
      </blockquote>
      <p>On decoding only entities will be performed (entity to character)<br />
      On encoding only UTF-8 characters will be performed (character gt 127 -> entity )<br />
      Named entities described in reference will be converted once (strtr)<br />
      I have to do it character by character because I have to check if this entity is valid for the given character set or not.<br />
      e.g. encode "öäü€" to ISO-8859-1 only the "€" must be converted.</p>

      <blockquote>
      <ul>
      <li>What are the security implications (if any) of the current implementation?</li>
      </ul>
      </blockquote>
      <p>This component only decodes entities to characters by converting the uni-code of the entity to UTF-8 and than to the configured character set. If this failed one of the configurable actions will be performed.<br />
      encodes characters to entities<br />
      or rather encodes characters to UTF-8 and than if not supported by given char-set to the uni-code as hex or decimal entity.<br />
      -> There is no handling of html specific tags where security should be done like the script-tag.</p>

      1. Aug 12, 2010

        <p>Marc – To be honest, this seems like a highly specific and specialized component, and I'm wondering if there's broad enough appeal to warrant its inclusion. Can you provide some more detailed use cases, and indicate the spectrum of use cases for which it would enable? Also, it seems to me like a component such as HTMLPurifier or Padraic's Wibble might offer a more secure approach here – could you comment on that, please?</p>

        1. Aug 12, 2010

          <p>Hi Marc,</p>

          <p>The first part of your response touches quite firmly on why the proposal was questioned. HTML from any external source (even if it appears trusted) must be considered insecure until it has been validated (or verified as being received from a trusted source - public web services are generally not trusted due to their potential for spawning a mass XSS attack). Since there is no actual validation for HTML possible (validation alone is completely useless), there exist HTML sanitisers like HTMLPurifier. Simply tidying up entities is not sufficient to prevent the potential for XSS and Phishing vulnerabilities. So this particular use case is not entirely suitable for your proposal, at least not without specifying the precise restrictions under which it applies.</p>

          <p>The second part addresses the valid use of entities to represent Unicode multibyte characters. These entities remain valid even for Unicode encoded HTML (so it is not a validation issue but a tidy output issue). Once you see it's related to output, again we fall into the scope of a HTML Sanitiser. The same case with the third (user editing via a RTE).</p>

          <p>On the fourth case, we're running into a few side issues. The native entity functions are not always ideal, but neither are they so completely useless as to necessitate a move to iconv in its entirety. This is where your proposal is missing details explaining why we need to be worried about different entity types which are all, presumably, valid in the context of HTML. For example, what is a "non-standard entity"? It's not defined anywhere. If these are illegal entities for standards compliant HTML, what is the scope of the problem they cause? We need to understand what the problem is before we can judge whether the solutions fits.</p>

          <p>To put all of this into perspective, and being fully aware that it is not accepted in the ZF at this time, Wibble (Zend/Html/Filter) should perform all of what you appear to be proposing with the added benefit that it would be a full featured HTML sanitiser.</p>

          <p>There was a reason for the security questions beyond the obvious. HTML vulnerabilities exist surrounding entity interpretation. For example, entities may be interpreted as their literal representations in specific contexts (notably attribute values, and CSS values). An attacker is more than capable of applying entities one or more times under the assumption that a single decoding run, will leave doubly encoded entities (e.g. &gt<ac:emoticon ac:name="wink" /> behind as a single entity (i.e. a decoding bypass). Since you don't treat the HTML via DOM or other HTML aware means, it seems from the proposal that this filter may not be capable of accounting for multiple levels of entities.</p>

          <p>While I can see the value of what the proposal is out to achieve, so far it seems to suffer from its narrow scope. In that respect, it's a little like Zend_Filter_StripTags. Does it strip tags? Yes. Does it strip tags under all circumstances? No. Does it actually add anything to security? Not really. It works, but in a limited fashion that needs to carefully considered. This filter has the same issue. It will perform exactly what it says it will and no more. It's one slice out of the bigger puzzle of filtering HTML. I'm not saying that makes it useless, just that it's only useful for very specific tightly controlled inputs. If it were to be added to the framework, I would insist that its documentation contain a section just to make the points above as clear as possible (assuming these are not based on a misunderstanding of the component's operation).</p>

          <p>Outside of security, the proposal still struggles to explain its purpose. Both the overview and operation sections are extremely brief and don't really make a case for including the component in the ZF. These sections need to try and make it clear to readers why we need/want these new filters. We get they encode/decode entities, but what makes that useful compared to every other option I presume people are already using (I just stick stuff like this through HTMLPurifier - but I'm sure people use other strategies).</p>

  6. Aug 16, 2010

    <p>Dude came in the support channel and asked me to translate this section for him. Sounds like plain english to me, else "/trout hadean" in #zftalk.dev @ freenode...</p>

    <p>START NUB ENGLISH :::</p>

    <p>This entity-filter attends to convert sections of text with entities, no matter if they come as HTML </p>

    <p>or XML (Wibble/Zend_HTML_FIlter though require the HTML format). A section can come with multiple </p>

    <p>character sets, like unicode for the entities and a normal one for text. The Filter helps converting </p>

    <p>these sections into a different character set with or without the entities. </p>

    <p>Why?<br />
    1. Not every program or service works with clean unicode but grown historically. When you are forced </p>

    <p>to work with these programs you have to convert them (I had to work with such services on many </p>

    <p>projects).</p>

    <p>2. Many users continue to use HTML-entities, even though the site already runs with native Unicode. </p>

    <p>With a little Help of this Filter, you can delete the entities overhead by converting the nonrelevant </p>

    <p>ones into UTF-8 that are not needed for escaping.</p>

    <p>3. When you whant to convert HTML into plain text, stripping out the tags (strip_tags()) wont do it. </p>

    <p>You have to as well delete the entities, wich only partially can be achieved with htmlentities and </p>

    <p>stuff, as they are using a predefined entity-table as well as limited character sets or functions to </p>

    <p>handle incompatible characters (like the euro-sign in a ISO-8859-1 section). </p>

    <p>Security:<br />
    As the filter does not work on HTML/XML-layer, its security measures are unnessesary. It is a fact that Unicode has to follow several rules, but through using iconv (or entities) not well formed Unicode-characters should be recognized and run the "invalidCharAction" (should be added to unit tests).</p>

    <p>Definitions: </p>
    <ul class="alternate">
    <li>invalid character: a character, wich can not be converted or not part of the output character set.</li>
    <li>invalid entity: an entity wich is undefined.</li>
    </ul>

    <p>::: STOP NUB ENGLISH</p>

    1. Aug 17, 2010

      <p>Entities operate in XML/HTML - i.e. definitely in the HTML layer. Some Unicode valid characters are a security risk (e.g. half-width characters among others <a href="http://heideri.ch/jso/#80">IE6 and halfwidth/fullwidth Unicode characters</a>). No mention whether the component will handle nested/double encodings in any way (a known reconstructive risk to bypass HTML/XML filters for decoding).</p>

      1. Aug 17, 2010

        <ul>
        <li>Entities operate in XML/HTML:<br />
        Yes, but often used outside of it</li>
        </ul>

        <ul>
        <li>Some Unicode valid characters are a security risk:<br />
        This is not a security issue of entities. You can write it as Entities AND as native characters on a unicode charset.<br />
        -> On your example page isn't any entity!<br />
        -> If you set a unicode charset as output you have to handle/filter this characters if where are entities or not.<br />
        -> If you don't set a unicode charset as output this filter calls the invalidCharAction because these entity-characters can't convert to your output charset.</li>
        </ul>

        <ul>
        <li>nested/double encodings:<br />
        I'm not sure if I understand you right.<br />
        Double encoded entities can't be automatically double decoded. You can't know if the text is a valid and native text after only one decode.<br />
        E.g.: If you paste an example html snipped on a comment to describe something your html snipped must be double encoded on display your comment. All reading this displayed html only should decode it once to get the native text.<br />
        What's wrong with it?</li>
        </ul>

        1. Aug 18, 2010

          <p>Marc, we're obviously not on the same page so let me try and put it into perspective.</p>

          <p>Consider a point to point use case from receiving input (containing entities) to echoing output to a browser. I'm guessing your first instinct is to pass everything (after entity filtering) through htmlspecialchars(). This is perfectly fine. Now, consider use case #1 from the proposal and you find a problem - you can't run htmlspecialchars() across HTML intended for direct output (i.e. from a rich text editor using HTML tags and styling). That would negate the HTML and remove the formatting/styling intended by the rich text editor. Using your filter adds nothing in this regard. You STILL can't run it through htmlspecialchars() - it's HTML, not plain text. So what's left to ensure secure output? Use a HTML Sanitiser (a secure one). This will not have any issues with your filter, but it also makes your filter redundant - a good sanitiser will decompose entities into their literal characters correctly. This basically makes use case #1 redundant from the perspective that a HTML sanitiser is still required and should not need any additional entity decoding beyond that.</p>

          <p>Use case #2 has similar issues. It's clearly both in the HTML domain and utilises HTML sanitisation via strip_tags() which is not considered a sanitisation option since it's not character encoding aware (also it strips tags but doesn't clean or validate anything that is added as a safe element, as would be necessary in different cases). strip_tags() is basically evil - even the manual doesn't note all of its limitations which is a shame.</p>

          <p>This is the point I'm making by emphasising security as a concern. You keep insisting that security is not an issue and that the component does not operate in the HTML domain, and thus fail to recognise that many uses for the component directly depend on bypassing output filtering in favour of HTML sanitisation. The component doesn't exist in isolation. This means that, whether you see it that way or not, your component is effectively aimed at replacing one element of HTML sanitisation. The only other alternative is that it is intended for use without a HTML sanitiser or with a far more limited sanitiser, neither of which is recommended security practice in PHP. In this respect the Decoder (I see nothing insecure about an Encoder) is redundant unless within a very narrow scope where all possible HTML elements are neutralised prior to output. This is not addressed or mentioned anywhere, and the context of the use cases suggests nothing to limit its scope.</p>

          <p>More relevantly (just to add to the confusion of using this alongside a sanitiser), there is the risk that someone would use the Decoder after HTML Sanitisation. Since the decoder filter is capable of reversing previously added entities (e.g. & to replace the & in any entities original to the input - usually done with attributes as a precaution against XSS/Phishing), it is possible to undo some sanitisation modifications. This might be wrong - there's no decoder implementation to refer to yet so I'm assuming it's possible going by the proposal text.</p>

          1. Sep 01, 2010

            <p>hi paddy</p>

            <p>This filter don't run it through htmlspecialchars() - it's using own functions for encoding and decoding and this process is configurable to keep existing special chars as is -> than <>'"& are not encoded but all other.</p>

            <p>The main job of this filter is to handle entities (and only entities) more adjustable than the build-in htmlentities and friend does. I mean it should help converting such documents from one charset to another by minimizing the entity overhead.</p>

            <p>You speaking about security but I don't say this makes some more secure than htmlentities. It only adds some functionalities. To simply encode html for browser output the usage of build-in functions does the same by better performance and a usage of a HTML Sanitiser should still need.</p>

            <blockquote><p>This basically makes use case #1 redundant from the perspective that a HTML sanitiser is still required and should not need any additional entity decoding beyond that.</p></blockquote>
            <p>Do you mean that this should be part of HTML Senitiser? Than should this usable out of DOM.</p>

            1. Sep 02, 2010

              <blockquote>
              <p>This filter don't run it through htmlspecialchars() - it's using own functions for encoding and decoding and this process is configurable to keep existing special chars as is -> than <>'"& are not encoded but all other.</p></blockquote>
              <p>Is the default to decode entities such as > or to keep them undecoded?</p>
              <blockquote>
              <p>The main job of this filter is to handle entities (and only entities) more adjustable than the build-in htmlentities and friend does. I mean it should help converting such documents from one charset to another by minimizing the entity overhead.</p></blockquote>
              <p>I'm not so worried about encoding/decoding as I am the context in which they are used. Many of the use cases show it being used prior to output (thus it's operation in that context should encourage secure behaviour). It has to be assumed that a secure HTML sanitiser is being used - you acknowledge this by uses of strip_tags(), for example. </p>
              <blockquote>
              <p>You speaking about security but I don't say this makes some more secure than htmlentities. It only adds some functionalities. To simply encode html for browser output the usage of build-in functions does the same by better performance and a usage of a HTML Sanitiser should still need.</p></blockquote>
              <p>That's what I've been saying. </p>
              <blockquote>
              <p>Do you mean that this should be part of HTML Senitiser? Than should this usable out of DOM.</p></blockquote>
              <p>It already should be part of a HTML Sanitiser. I'm willing to agree that most HTML Sanitisers are insecure anyway, but any good one will handle entities correctly. Try adding any text to have its entities decoded between character sets into HTMLPurifier's Demo for example. It already handles a good part of what you're proposing.</p>

              1. Sep 04, 2010

                <blockquote><p>Is the default to decode entities such as > or to keep them undecoded?</p></blockquote>
                <p>For default all characters will be encoded / decoded.</p>

                <blockquote><p>It already should be part of a HTML Sanitiser. I'm willing to agree that most HTML Sanitisers are insecure anyway, but any good one will handle entities correctly. Try adding any text to have its entities decoded between character sets into HTMLPurifier's Demo for example. It already handles a good part of what you're proposing.</p></blockquote>
                <p>I took some look within wibble and HTMLPurifier but I can't find enough information how to:</p>
                <ul class="alternate">
                <li>remove entities used for characters out of document charset (translit, substitute, ignore, ...)</li>
                <li>convert a document from one charset to another</li>
                <li>handling unknown (self defined) entities</li>
                <li>encoding to named entities vs. numeric vs. hex</li>
                </ul>