ZF-11001: Better UTF-16 support for PDF Metadata stored in Document Information Dictionary (DID)
Description
Currently in the DID, Zend_Pdf treats certain metadata properties specially. Namely: Title, Author, Subject, Keywords, Creator, Producer, ModDate, CreationDate. The non-date fields will be re-encoded as UTF-16 if they should be. However, the PDF standard specifies all metadata should be in either ASCII or UTF-16.
If you are managing metadata with Zend_Pdf, it is confusing (and should not be necessary) to have to treat your custom metadata fields separately from the Zend-handled fields listed above. Further, custom data should adhere to the PDF standard (whether or not the user knows it). For consistency and standards, Zend_Pdf should encode all data that needs to be as UTF-16.
Attaching patches that accomplish this. Feedback welcome.
Comments
Posted by Chris Hiestand (dimmer) on 2011-01-27T00:37:14.000+0000
How does one attach files? I can't find the option. So I'm attaching patches inline to this comment.
--- Pdf.orig.php 2011-01-26 23:18:21.000000000 -0800 +++ Pdf.php 2011-01-26 23:59:57.000000000 -0800 @@ -1226,12 +1226,6 @@ case 'Creator': // break intentionally omitted case 'Producer': - if (extension_loaded('mbstring') === true) { - $detected = mb_detect_encoding($value); - if ($detected !== 'ASCII') { - $value = "\xfe\xff" . mb_convert_encoding($value, 'UTF-16', $detected); - } - } $docInfo->$key = new Zend_Pdf_Element_String((string)$value); break;
--- String.orig.php 2011-01-26 23:18:39.000000000 -0800 +++ String.php 2011-01-27 00:05:27.000000000 -0800 @@ -70,7 +70,15 @@ */ public function toString($factory = null) { - return '(' . self::escape((string)$this->value) . ')'; + $value = $this->value; + if (extension_loaded('mbstring') === true) { + $detected = mb_detect_encoding($value); + if (!empty($detected) and $detected !== 'ASCII') { + $value = "\xfe\xff" . mb_convert_encoding($value, 'UTF-16', $detected); + } + } +
+ return '(' . self::escape((string)$value) . ')'; }
Posted by Chris Hiestand (dimmer) on 2011-01-27T00:49:38.000+0000
I couldn't escape all the wiki formatting characters, but you get the point here...
--- Pdf.orig.php 2011-01-26 23:18:21.000000000 -0800 +++ Pdf.php 2011-01-26 23:59:57.000000000 -0800 @@ -1226,12 +1226,6 @@ case 'Creator': // break intentionally omitted case 'Producer': - if (extension_loaded('mbstring') === true) { - $detected = mb_detect_encoding($value); - if ($detected !== 'ASCII') { - $value = "\xfe\xff" . mb_convert_encoding($value, 'UTF-16', $detected); - } - } $docInfo->$key = new Zend_Pdf_Element_String((string)$value); break;
--- String.orig.php 2011-01-26 23:18:39.000000000 -0800 +++ String.php 2011-01-27 00:05:27.000000000 -0800 @@ -70,7 +70,15 @@ */ public function toString($factory = null) { - return '(' . self::escape((string)$this->value) . ')'; + $value = $this->value; + if (extension_loaded('mbstring') === true) { + $detected = mb_detect_encoding($value); + if (!empty($detected) and $detected !== 'ASCII') { + $value = "\xfe\xff" . mb_convert_encoding($value, 'UTF-16', $detected); + } + } +
+ return '(' . self::escape((string)$value) . ')'; }