Issues

ZF-3013: add Filter HtmlEntityDecode & HtmlEntityEncode (rename HtmlEntities -> HtmlEntityEncode)

Description

  • add the filter Zend_Filter_HtmlEntityDecode to decode html entities to plain text
  • rename the filter Zend_Filter_HtmlEntities to Zend_Filter_HtmlEntityEncode for named consistently

add features not supported by php's functions: - decode and encode hex entities - decode and encode decimal entities - selectable if named entities will be used (on encode) - selectable if hex or decimal entities will be used (on encode) - selectable to throw an exception if an entity can't convert to the given encoding or ignore it - use mbstring (if installed) to encode/decode encodings not supported by htmlentities and friends

Comments

Please evaluate and categorize/assign as necessary.

It seems like the proposed change would have to wait for 2.0 for BC reasons. Matthew, please evaluate whether it is something we should do and, if so, when we should do it.

Waiting for reply from dev-team (ralph) since 26.06.09

This is a little script to convert a html text to UTF-8: (It is not optimal because it doesn't handle script tags and some other specials)


// $text = html text
// $enc = encoding of html text
$text = mb_convert_encoding($text, 'UTF-8', $enc);
// convert numeric entities
$pattern = '/\d{2,5};/u';
$text = preg_replace_callback($pattern, '_decEntitiesToUtf8', $text);
// convert hex entities
$pattern = '/([a-f0-9]{2,8});/ui'; // erst in XML muss das "x" klein geschrieben werden
$text = preg_replace_callback($pattern, '_hexEntitiesToUtf8', $text);
// convert named entities (but not ",',<,>,&)
$htmlTransTable = get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES);
$xmlTransTable  = get_html_translation_table(HTML_SPECIALCHARS, ENT_NOQUOTES);
$myTransTable   = array();
foreach (array_diff_assoc($htmlTransTable, $xmlTransTable) as $char => $ent) {
    $myTransTable[$ent] = mb_convert_encoding($char,'UTF-8','ISO-8859-1');
}
$text = strtr($text, $myTransTable);

function _decEntitiesToUtf8($matches) {
    $convmap = array(0x0, 0x10000, 0, 0xfffff);
    $transTable = get_html_translation_table(HTML_SPECIALCHARS, ENT_QUOTES);
    $ret = mb_decode_numericentity($matches[0], $convmap, 'UTF-8');
    if (isset($transTable[$ret])) {
        return $transTable[$ret];
    } else {
        return $ret;
    }
}

function _hexEntitiesToUtf8($matches) {
    $convmap = array(0x0, 0x10000, 0, 0xfffff);
    $transTable = get_html_translation_table(HTML_SPECIALCHARS, ENT_QUOTES);
    $ret = mb_decode_numericentity(hexdec($matches[1]), $convmap, 'UTF-8');
    if (isset($transTable[$ret])) {
        return $transTable[$ret];
    } else {
        return $ret;
    }
}

I got reply some days ago from the devteam...

No for the renaming from HtmlEntities to HtmlEntityEncode Yes for the new HtmlEntityDecode. No for mbstring because iconv is installed per default but mbstring not.

I created a proposal for this: http://framework.zend.com/wiki/pages/…

-> the functionality is different to htmlentities, htmlspecialchars and friends and should not replace these filter(s).

A basic Zend_Filter_HtmlEntityDecode:


<?php

class Zend_Filter_HtmlEntityDecode implements Zend_Filter_Interface
{

    /**
     * Corresponds to the second htmlentities() argument
     *
     * @var integer
     */
    protected $_quoteStyle;

    /**
     * Corresponds to the third htmlentities() argument
     *
     * @var string
     */
    protected $_encoding;

    /**
     * Sets filter options
     *
     * @param  integer|array $quoteStyle
     * @param  string  $charSet
     * @return void
     */
    public function __construct($options = array())
    {
        if (!is_array($options)) {
            $options = func_get_args();
            $temp['quotestyle'] = array_shift($options);
            if (!empty($options)) {
                $temp['charset'] = array_shift($options);
            }

            $options = $temp;
        }

        if (!isset($options['quotestyle'])) {
            $options['quotestyle'] = ENT_COMPAT;
        }

        if (!isset($options['encoding'])) {
            $options['encoding'] = 'UTF-8';
        }
        if (isset($options['charset'])) {
            $options['encoding'] = $options['charset'];
        }

        $this->setQuoteStyle($options['quotestyle']);
        $this->setEncoding($options['encoding']);
    }

    /**
     * Returns the quoteStyle option
     *
     * @return integer
     */
    public function getQuoteStyle()
    {
        return $this->_quoteStyle;
    }

    /**
     * Sets the quoteStyle option
     *
     * @param  integer $quoteStyle
     * @return Zend_Filter_HtmlEntities Provides a fluent interface
     */
    public function setQuoteStyle($quoteStyle)
    {
        $this->_quoteStyle = $quoteStyle;
        return $this;
    }

    /**
     * Get encoding
     *
     * @return string
     */
    public function getEncoding()
    {
         return $this->_encoding;
    }

    /**
     * Set encoding
     *
     * @param  string $value
     * @return Zend_Filter_HtmlEntities
     */
    public function setEncoding($value)
    {
        $this->_encoding = (string) $value;
        return $this;
    }

    /**
     * Returns the charSet option
     *
     * Proxies to {@link getEncoding()}
     *
     * @return string
     */
    public function getCharSet()
    {
        return $this->getEncoding();
    }

    /**
     * Sets the charSet option
     *
     * Proxies to {@link setEncoding()}
     *
     * @param  string $charSet
     * @return Zend_Filter_HtmlEntities Provides a fluent interface
     */
    public function setCharSet($charSet)
    {
        return $this->setEncoding($charSet);
    }

    /**
     * Defined by Zend_Filter_Interface
     *
     * Returns the string $value, converting HTML entity to their corresponding characters
     * equivalents where they exist
     *
     * @param  string $value
     * @return string
     */
    public function filter($value)
    {
        return html_entity_decode((string) $value, $this->getQuoteStyle(), $this->getEncoding());
    }

}

not accepted by CR-Team