Issues

ZF-19: Binary file parser class for Zend_Pdf

Description

PDF documents may contain large binary files such as TIFF, JPEG, or PNG images and TrueType fonts. Including these files inside the PDF document requires that they first be parsed and certain data extracted from them so that an appropriate information dictionary can be constructed.

Currently, the object constructor methods for each of these binary types are written to read and deal directly with filesystem objects using the traditional fopen(), fread(), unpack() and similar functions.

This does not allow them to use data from other sources, such as in-memory images generated by GD, without first writing the data to a temporary file on disk. Additionally, many of the primitive parser functions, such as extracting a four byte unsigned integer, must be re-implemented in each class. Finally, the robust error-handling code necessary for interacting with the filesystem clutters the actual parser code, making it more difficult to follow.

Build an abstract file parser class for use by these objects with the following functionality:

Offer a complete library of common primitive functions via a simple public API: moveToOffset(), readInt(), readBytes(), skipBytes(), etc.

Number extraction functions must be platform-independent and allow for both big- and little-endian byte orders for all numeric types.

Provide an abstraction for the specific data source. Initially must provide filesystem and in-memory (binary string) sources. Must allow for the use of any kind of 'seekable' data source, including subsets of files and possibly even database records.

Data sources must implement robust error checking.

Specific parser types (image, font, etc.) are concrete subclasses which can extend or override the base parser functionality.

Comments

Done. Please refer to the inline API documentation for a complete reference. Here's a quick introduction:

For absolute seeking, you use {{moveToOffset()}}. Relative seeking is usually handled automatically by the read functions. That is to say, {{readBytes()}}, {{readUInt()}}, {{readInt()}}, etc. all shift the current offset forward by the number of bytes read. You can also move forward or backward relative to the current position at any time using {{skipBytes()}}.

Here's how you use the parser:

First, you'll need to create a concrete subclass of Zend_Pdf_FileParser. This subclass needs to implement two methods:

{{screen()}} - Performs a cursory check to verify that the file is in the expected format. Intended to quickly weed out obviously bogus files. For example, check the file signature bytes at the top of the file.

{{parse()}} - Does the actual parsing. The parse implementation uses the functions provided by the base class to move around the file and extract data from it. Here is an example:


public function parse()
{
    // this file uses little-endian byte ordering for numbers
    $byteOrder = Zend_Pdf_Const::BYTEORDER_LITTLEENDIAN;

    // move to the top of the file and read in the eight signature bytes
    $this->moveToOffset(0);
    $signature = $this->readBytes(8);

    if ($signature != 'CoolFile') throw new Exception('Bad signature!');

    // next, read in the number of tables. this is a 2-byte unsigned integer
    $tableCount = $this->readUInt(2, $byteOrder);

    // skip over a bunch of stuff you don't care about
    $this->skipBytes(24);

    // read in a series of integers of various sizes and flavors
    $unsignedIntA = $this->readUInt(1, $byteOrder);
    $unsignedIntB = $this->readUInt(2, $byteOrder);
    $unsignedIntC = $this->readUInt(4, Zend_Pdf_Const::BYTEORDER_BIGENDIAN);

    // and some signed integers
    $signedIntA = $this->readInt(1, $byteOrder);
    $signedIntB = $this->readInt(2, Zend_Pdf_Const::BYTEORDER_BIGENDIAN);
    $signedIntC = $this->readInt(4, $byteOrder);
    
    // move somewhere else in the file
    $this->moveToOffset($unsignedIntC);

    // read some raw binary data
    $data = $this->readBytes(512);

    // and so on...
}

The {{parse()}} implementation should usually store the extracted information in instance variables. You will then provide appropriate accessor methods to obtain the values. You could alternatively have an empty {{parse()}} method and obtain everything lazily from accessor methods. The primitive functions provided by the base class and the data source being parsed are available for the entire life span of the object.

The approach you take will largely depend on the intent of the parser you are writing. In the case of fonts, we need virtually all of the information in the file, so it's all extracted at once into a big associative array and made available via {{__get()}}.

If you'd rather provide a utility class that is able to extract different bits and pieces from the file, but never everything at once, you would create a bunch of accessor methods that act as mini-parsers. This way you don't incur the overhead of parsing the entire file unless it is absolutely necessary.

Next, you will initialize a data source to be parsed and then instantiate and run the parser itself:


// takes care of validating file paths, permissions, etc.
$dataSource = new Zend_Pdf_FileParserDataSource_File('/path/to/file');

$parser = new Zend_Pdf_FileParser_MyGreatParser($dataSource);
$parser->parse();

Once the parsing is complete, you will obtain the data you are interested in:


// these values might then be used to create other PDF objects such as
// names, dictionaries, arrays, or streams
$someValue = $parser->getSomeValue();
$anotherValue = $parser->getAnotherValue();

When complete, destroy the parser and the data source:


$parser = null;
$dataSource = null;

For a sample implementation, please refer to the following files (line numbers are as of revision 650):

Zend_Pdf_Font Manages creation and destruction of the data source object (lines 622-625 and 671-673) and instantiation of the parser and the PDF font object itself (lines 719-721).

Zend_Pdf_FileParser_Font_OpenType_TrueType Contains the screen() implementation (line 41).

Zend_Pdf_FileParser_Font_OpenType Contains the bulk of the parse() implementation (line 98).

Zend_Pdf_Resource_Font_OpenType Starts the font parser object (line 67), obtains the parsed data, and creates the font resource object based on that data.

Zend_Pdf_Resource_Font_OpenType_TrueType Additional extraction from the font parser object.