ZF-9141: ZFReader getDescription() decodes html entities with wrong character set

Description

{{$description = html_entity_decode($description, ENT_QUOTES, $this->getEncoding());}}

DOMDocument internally uses UTF-8 so typically content coming from this component is encoded in Unicode regardless of how the xml file itself is encoded. Latin1 £ comes back as unicode £ sign, but & pound; is passed through html_entity_decode using document's charset resulting in latin1 £ sign. The output becomes an invalid Unicode string. This should be changed so UTF-8 is always used for output:

{{$description = html_entity_decode($description, ENT_QUOTES, 'UTF-8');}}

Comments

Hardcoding the encoding without first converting the string to that encoding is typically a bad idea; while this can be done using iconv or mbstring, we try hard to only do that when absolutely necessary. Additionally, DOMDocument will use the encoding you specify so long as the encoding is set in the object prior to populating it; that same encoding will then be used when you cast the object to XML.

Paddy -- I'll let you triage from here.

I'll be looking into this later today, but Matthew is right in that there is no way I can alter the encoding setting safely. It's based entirely on the input feed and defaults to UTF-8 only if the encoding cannot be detected at all.

A few of us ran over the current API for Zend_Feed_* prior to 1.10, and I have a long list of improvements (from the very large to the very small) being worked through. This particular issue IS actually a bug since technically, using html_entity_decode() at this point is in error. The problem, as you're seeing, is that we are attempting to do some translation between XML and HTML. It took us a while to realise that, while it makes conceptual sense to pull out literal characters, doing so ignores the context - the XML is a vehicle for HTML (not plain text in most cases) and entities are valid HTML, i.e. the translation is unnecessary.

What this means in terms of results, is that instead of getting a £ character you should get back ``` instead. Since the context is HTML, this would display correctly under any encoding. You will lose some context in that it's a currency sign, but presumably the encoding was sourced from the original HTML article so it was an entity by design (not automatically encoded from a literal £ character). This does increase the risk of a misbehaving source (i.e. it's actually plain text and the entity encoding was poorly applied) but we can't solve every single problem with feeds (and God knows there's a lot of those already from bad encoding to formatting dates invalidly).

Hopefully that makes sense ;).

I think it makes sense, but it should be possible to fix current implementation too, so at least it produces valid output for now. If I got this right, the html_entity_decode() function needs to use output encoding not input encoding for this to work. Input encoding is irrelevant in this case as you're decoding entities compose of ASCII characters. The result of decoding needs to match other characters returned from getDescription which seem to always be encoded in Unicode. My impression was that DOMDocument will always use UTF-8 hence ZFReader methods will always return UTF-8 too, if it's not the case you could just detect current output encoding instead of hardcoding it as I originally suggested.

Bringing this back into perspective (after my short break from ZF), I've decided that the correct move is to simply remove the decoding. Applying html_entity_decode at this point makes a certain amount of sense, but given the problem with detecting/using the correct encoding, and the fact we are altering the original content unnecessarily, it's just not adding anything of much value.

Since any backwards compatibility issue can be qualified in terms that we are restoring the original content within a HTML context, I see no reason why this can't be done immediately for the next mini-release.

Any comments from the reporter or those watching this issue?

I agree this solution makes sense.

Fixed in r22300 - thanks for bringing this issue up, and your patience ;)