ZF-19: Binary file parser class for Zend_Pdf
Description
PDF documents may contain large binary files such as TIFF, JPEG, or PNG images and TrueType fonts. Including these files inside the PDF document requires that they first be parsed and certain data extracted from them so that an appropriate information dictionary can be constructed.
Currently, the object constructor methods for each of these binary types are written to read and deal directly with filesystem objects using the traditional fopen(), fread(), unpack() and similar functions.
This does not allow them to use data from other sources, such as in-memory images generated by GD, without first writing the data to a temporary file on disk. Additionally, many of the primitive parser functions, such as extracting a four byte unsigned integer, must be re-implemented in each class. Finally, the robust error-handling code necessary for interacting with the filesystem clutters the actual parser code, making it more difficult to follow.
Build an abstract file parser class for use by these objects with the following functionality:
Comments
Posted by Willie Alberty (willie) on 2006-06-17T13:59:58.000+0000
Done. Please refer to the inline API documentation for a complete reference. Here's a quick introduction:
For absolute seeking, you use {{moveToOffset()}}. Relative seeking is usually handled automatically by the read functions. That is to say, {{readBytes()}}, {{readUInt()}}, {{readInt()}}, etc. all shift the current offset forward by the number of bytes read. You can also move forward or backward relative to the current position at any time using {{skipBytes()}}.
Here's how you use the parser:
First, you'll need to create a concrete subclass of Zend_Pdf_FileParser. This subclass needs to implement two methods:
{{screen()}} - Performs a cursory check to verify that the file is in the expected format. Intended to quickly weed out obviously bogus files. For example, check the file signature bytes at the top of the file.
{{parse()}} - Does the actual parsing. The parse implementation uses the functions provided by the base class to move around the file and extract data from it. Here is an example:
The {{parse()}} implementation should usually store the extracted information in instance variables. You will then provide appropriate accessor methods to obtain the values. You could alternatively have an empty {{parse()}} method and obtain everything lazily from accessor methods. The primitive functions provided by the base class and the data source being parsed are available for the entire life span of the object.
The approach you take will largely depend on the intent of the parser you are writing. In the case of fonts, we need virtually all of the information in the file, so it's all extracted at once into a big associative array and made available via {{__get()}}.
If you'd rather provide a utility class that is able to extract different bits and pieces from the file, but never everything at once, you would create a bunch of accessor methods that act as mini-parsers. This way you don't incur the overhead of parsing the entire file unless it is absolutely necessary.
Next, you will initialize a data source to be parsed and then instantiate and run the parser itself:
Once the parsing is complete, you will obtain the data you are interested in:
When complete, destroy the parser and the data source:
Posted by Willie Alberty (willie) on 2006-06-17T15:40:10.000+0000
For a sample implementation, please refer to the following files (line numbers are as of revision 650):
Zend_Pdf_Font Manages creation and destruction of the data source object (lines 622-625 and 671-673) and instantiation of the parser and the PDF font object itself (lines 719-721).
Zend_Pdf_FileParser_Font_OpenType_TrueType Contains the screen() implementation (line 41).
Zend_Pdf_FileParser_Font_OpenType Contains the bulk of the parse() implementation (line 98).
Zend_Pdf_Resource_Font_OpenType Starts the font parser object (line 67), obtains the parsed data, and creates the font resource object based on that data.
Zend_Pdf_Resource_Font_OpenType_TrueType Additional extraction from the font parser object.