Page by Page Processing of PDF Documents

Post by **Devteam** » Fri Nov 21 2008

Background information

A PDF document typically contains of a collection of objects each having a numeric identifier. In order to locate an object, the PDF file contains a cross-reference table that indicates for each object, its location within the PDF file. The cross-reference table is placed at the end of a PDF document which means that the PDF file has to be completely generated before an application can process it.
Each object in a PDF document starts with:
N r obj
Where N is the ID of the object and r is called a revision number which is always 0 in our case. An object always ends with
endobj

Each page in a document consists of a number of objects, mainly:

A page description object which contains things like paper size, rotation and identification of all other objects used by that page (resources.)
A page contents object which contains the instructions to render the page on a screen or printer.
A number of resources such as images and fonts used to render the page. Each resource can be made of one or more object.

The resources used by a page can be shared by multiple pages and are usually randomly located throughout the document, i.e. the ordering of the objects is left to the developer’s choice.
In order to process a page, all of its objects need to located by the client application. If one object is missing (e.g. not yet received), processing will fail.

Solution implemented by Amyuni for page by page processing
The first part of the solution is to reorder the objects in such a way that a page is immediately followed by the resources it uses. If the resource has already been transmitted, then it will not be transmitted a second time. The re-ordering of the objects means that a page can be processed entirely before waiting for the whole document to finish saving.

The second part of the solution is to notify the processing application each time a new object is started (StartObject event), when the object has ended (EndObject event) and when all the objects needed to process a page have been output (EndPage event.)

Page by page processing using Amyuni PDF Creator or any external tool
1) The tool should not rely on having a cross reference table but builds one dynamically in memory. Each time a StartObject event is received, the current position in the stream or file is saved to the dynamic cross reference table and the object number increased by 1.
2) When the EndPage event is received, the tool knows that it has all the objects needed to process the page and starts processing that page as it would process any regular PDF file.

Other notes
Fonts
Special care should be taken for font handling. Most applications do not embed the full font into a PDF but embed only the character set that is actually used by the document in order to reduce the file size. This means however that before outputting the font object, all the pages have been generated and examined to determine what are all the characters used by the document. When processing page by page, we cannot wait till the document is finished, the font has to be output right after the first page that uses it which means one of three things:
1) The font is located on the system where processing occurs: There is no need to embed the font data and page by page processing creates no issues.
2) The font is not located on the end system: It has to be fully embedded into the PDF after the very first page that uses it, otherwise the page cannot be processed. Once the font is fully embedded, there is no need to output a second time, all pages can use the same font.
3) The font is not located on the end system: Every page can have its own subset of a given font, i.e. we can have multiple Arial fonts in the same file, each page having its own subset of the Arial font. But this option typically generates larger documents than option 2 and creates files that are very difficult to process performance wise.
All three options are possible in the case of the Amyuni printer driver, option 3 is however not recommended.

Images
In order to save memory, the processing application might try to delete an image once the page containing it has been processed or printed. The developer needs to be aware that the same image might be used in subsequent pages and will not be output a second time. E.g. pages 1 and 11 might contain the same image, so after processing page 1 the developer should not attempt to delete images otherwise page 11 will crash.