ZF-19: Binary file parser class for Zend_Pdf

Issue Type: New Feature Created: 2006-06-17T13:35:35.000+0000 Last Updated: 2007-07-05T14:44:27.000+0000 Status: Closed Fix version(s): - 0.1.4 (29/Jun/06)

Reporter: Willie Alberty (willie) Assignee: Willie Alberty (willie) Tags: - Zend_Pdf

Related issues: - ZF-11



PDF documents may contain large binary files such as TIFF, JPEG, or PNG images and TrueType fonts. Including these files inside the PDF document requires that they first be parsed and certain data extracted from them so that an appropriate information dictionary can be constructed.

Currently, the object constructor methods for each of these binary types are written to read and deal directly with filesystem objects using the traditional fopen(), fread(), unpack() and similar functions.

This does not allow them to use data from other sources, such as in-memory images generated by GD, without first writing the data to a temporary file on disk. Additionally, many of the primitive parser functions, such as extracting a four byte unsigned integer, must be re-implemented in each class. Finally, the robust error-handling code necessary for interacting with the filesystem clutters the actual parser code, making it more difficult to follow.

Build an abstract file parser class for use by these objects with the following functionality:

Offer a complete library of common primitive functions via a simple public API: moveToOffset(), readInt(), readBytes(), skipBytes(), etc.

Number extraction functions must be platform-independent and allow for both big- and little-endian byte orders for all numeric types.

Provide an abstraction for the specific data source. Initially must provide filesystem and in-memory (binary string) sources. Must allow for the use of any kind of 'seekable' data source, including subsets of files and possibly even database records.

Data sources must implement robust error checking.

Specific parser types (image, font, etc.) are concrete subclasses which can extend or override the base parser functionality.


Posted by Willie Alberty (willie) on 2006-06-17T13:59:58.000+0000

Done. Please refer to the inline API documentation for a complete reference. Here's a quick introduction:

For absolute seeking, you use moveToOffset(). Relative seeking is usually handled automatically by the read functions. That is to say, readBytes(), readUInt(), readInt(), etc. all shift the current offset forward by the number of bytes read. You can also move forward or backward relative to the current position at any time using skipBytes().

Here's how you use the parser:

First, you'll need to create a concrete subclass of Zend_Pdf_FileParser. This subclass needs to implement two methods:

screen() - Performs a cursory check to verify that the file is in the expected format. Intended to quickly weed out obviously bogus files. For example, check the file signature bytes at the top of the file.

parse() - Does the actual parsing. The parse implementation uses the functions provided by the base class to move around the file and extract data from it. Here is an example:

<pre class="highlight">
public function parse()
    // this file uses little-endian byte ordering for numbers
    $byteOrder = Zend_Pdf_Const::BYTEORDER_LITTLEENDIAN;

    // move to the top of the file and read in the eight signature bytes
    $signature = $this->readBytes(8);

    if ($signature != 'CoolFile') throw new Exception('Bad signature!');

    // next, read in the number of tables. this is a 2-byte unsigned integer
    $tableCount = $this->readUInt(2, $byteOrder);

    // skip over a bunch of stuff you don't care about

    // read in a series of integers of various sizes and flavors
    $unsignedIntA = $this->readUInt(1, $byteOrder);
    $unsignedIntB = $this->readUInt(2, $byteOrder);
    $unsignedIntC = $this->readUInt(4, Zend_Pdf_Const::BYTEORDER_BIGENDIAN);

    // and some signed integers
    $signedIntA = $this->readInt(1, $byteOrder);
    $signedIntB = $this->readInt(2, Zend_Pdf_Const::BYTEORDER_BIGENDIAN);
    $signedIntC = $this->readInt(4, $byteOrder);
    // move somewhere else in the file

    // read some raw binary data
    $data = $this->readBytes(512);

    // and so on...

The parse() implementation should usually store the extracted information in instance variables. You will then provide appropriate accessor methods to obtain the values. You could alternatively have an empty parse() method and obtain everything lazily from accessor methods. The primitive functions provided by the base class and the data source being parsed are available for the entire life span of the object.

The approach you take will largely depend on the intent of the parser you are writing. In the case of fonts, we need virtually all of the information in the file, so it's all extracted at once into a big associative array and made available via __get().

If you'd rather provide a utility class that is able to extract different bits and pieces from the file, but never everything at once, you would create a bunch of accessor methods that act as mini-parsers. This way you don't incur the overhead of parsing the entire file unless it is absolutely necessary.

Next, you will initialize a data source to be parsed and then instantiate and run the parser itself:

<pre class="highlight">
// takes care of validating file paths, permissions, etc.
$dataSource = new Zend_Pdf_FileParserDataSource_File('/path/to/file');

$parser = new Zend_Pdf_FileParser_MyGreatParser($dataSource);

Once the parsing is complete, you will obtain the data you are interested in:

<pre class="highlight">
// these values might then be used to create other PDF objects such as
// names, dictionaries, arrays, or streams
$someValue = $parser->getSomeValue();
$anotherValue = $parser->getAnotherValue();

When complete, destroy the parser and the data source:

<pre class="highlight">
$parser = null;
$dataSource = null;

Posted by Willie Alberty (willie) on 2006-06-17T15:40:10.000+0000

For a sample implementation, please refer to the following files (line numbers are as of revision 650):

Zend_Pdf_Font Manages creation and destruction of the data source object (lines 622-625 and 671-673) and instantiation of the parser and the PDF font object itself (lines 719-721).

Zend_Pdf_FileParser_Font_OpenType_TrueType Contains the screen() implementation (line 41).

Zend_Pdf_FileParser_Font_OpenType Contains the bulk of the parse() implementation (line 98).

Zend_Pdf_Resource_Font_OpenType Starts the font parser object (line 67), obtains the parsed data, and creates the font resource object based on that data.

Zend_Pdf_Resource_Font_OpenType_TrueType Additional extraction from the font parser object.

Have you found an issue?

See the Overview section for more details.


© 2006-2016 by Zend, a Rogue Wave Company. Made with by awesome contributors.

This website is built using zend-expressive and it runs on PHP 7.