ZF-3439: TMX Adapter shifts some source ISO-8859-1 diacritic characters into double-byte UTF-8 output
This specific bug relates to the Zend_Translate TMX Adapter. It may also apply to other XML translation adapters.
It appears from a cursory test that the Adapter's use of xml_parser_create() creates a situation where in the incoming character encoding of the source input, is transferred to UTF-8 when output. This behaviour is enabled by default and documented in the PHP manual.
As a result ISO-8859-1 diacritics in the input source are transformed into their equivelant double-byte UTF-8 counterparts (i.e. the base character + diacritic). This breaks translations using ISO-8859-1 diacritics in the TMX XML format, when the HTML output is sent to the client browser using an ISO-8859-1 Content-Type.
A simple test is sufficient to prove the encoding shift.
Assuming an ISO-8859-1 encoding TMX file (see http://svn.astrumfutura.org/zfblog/branches/… ), the following does a simple strlen() byte count on the output from a single translated umlaut.
require_once 'Zend/Translate.php'; $translate = new Zend_Translate('tmx', 'translate_de.xml', 'de'); // ISO-8859-1 encoded source $translatedUmlaut = $translate->_('umlaut'); echo strlen($translatedUmlaut); // 2, expected 1 byte only
Notably, no browser here so that part of encoding is irrelevant. Instead a simple internal check shows the output umlaut is a double-byte character when output. Throwing this into a browser with a Content-Type set to the ISO-8859-1 charset results in "Ã ¼" being output.
The rawest solution is to locate the xml_parser_call() used in the TMX adapter and pass it the optional output encoding parameter set to "ISO-8859-1" which ensures encoding is untouched, and corrects the above problem. I have not done a similar check against other XML based adapters, but presumably whereever xml_parser_create() turns up this possible problem will follow it.