|
Introduction
This document is the first of two that will look at some of the challenges faced by developers and non-developers who work with PDF technologies and who are curious about what causes fonts in a PDF to render incorrectly or even go missing. Specifically, these documents provide an overview of some of the problems associated with missing font information in PDFs.
The first document presents the Portable Document Format as well as industry terms and concepts related to that format. The problem of missing font information will also be introduced. The second document expands on those terms and concepts and explores some of the common scenarios in which PDFs are either missing partial or entire font information.
Brief Overview of PDF
The Portable Document Format was originally conceived in 1991 as the Camelot Project, by Adobe’s co-founder Dr. John Warnock. Inspired by the device independence of PostScript, Dr. Warnock wanted to develop a technology that could accurately display and print electronic documents across different operating systems, hardware, or applications. His answer was the PDF.
Unlike its predecessor (i.e., PostScript), PDF was first and foremost a file format and not a programming language. Although PDF evolved from PostScript, the primary difference is that PostScript is a true page description language and PDF is not. PDF does not contain programming constructs such as looping, control-flow constructs, or variables. Rather, PDF was envisioned to go further than PostScript by being able to describe how pages behave and what type of information a document could contain. Years later, the PDF would encompass complex features and functionalities such as search capabilities, audio, and even video.
On July 1, 2008, PDF became an open standard published by ISO as ISO 32000-1: 2008.
PDF Structure
PDFs are essentially collections of data objects organized in a hierarchical manner that describe how one or more pages in a document must be displayed. These data objects can describe a page, a resource, other objects, a sequence of operating instructions, and so on. Furthermore, a data object can reference other objects and be referenced by other objects (i.e., an object can be a parent object and a child object at the same time).
PDF documents contain four main types of objects that define its structure:
• the document catalog object
• page objects
• page content objects
• document and page resources
The document object typically contains a cross reference table and page objects. It can also contain elements such as document information, named destinations, thumbnails, and bookmarks.
Page Objects
Page objects can contain one or more content objects as well as several other types of elements such as page cropping information, hyperlinks, article threads, file annotations, form fields, digital signatures, and child pages in the document. Page objects also contain references to all the resources used by a page.
Content Objects
Content objects contain marking operators (i.e., drawings) and use resources such as fonts, images or colorspaces that are needed to fully render the page.
Resource Objects
PDF defines a number of resource objects such as fonts, images, color spaces, patterns, etc. Fonts are needed to render text, color spaces represent colors used in the document, patterns define how backgrounds are painted, etc.
PDF Organization
PDFs are sectioned into four separate areas:
• the header
• the body
• the cross-reference table (xref)
• the trailer
The Header
The header contains a comment that identifies the nature of a PDF document and the specifications to which it adheres. For example, the comment outlined in Figure 1 indicates that the document conforms to Version 1.7 of the PDF specification.
Figure 1. Header
The Body
The body of a PDF is where the content objects in the document are located. These objects include text streams, image data, fonts, annotations, and so on (see Figure 2 below). The body can also contain numerous types of invisible (non-display) objects that help implement the document's interactivity, security features, or logical structure. Each object has three essential components: a numerical identifier, a fixed position (also known as an offset), and its content.
Figure 2. Example of Objects in the body
7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /CGFGAX+TRReservedPIFont,BoldItalic/FirstChar 32
/LastChar 35
/Widths [
220 265 187 567]
/FontDescriptor 8 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /CGFGAX+TRReservedPIFont,BoldItalic/Flags 4
/FontBBox [ -140 -269 1027 906 ]
/Ascent 704
/Descent 269
/FontFile2 9 0 R
…
>>
Endobj
The Cross-Reference Table
The cross-reference table (see Figure 3 below) lists the locations of all the objects in a PDF document. The cross-reference table is divided into sections where each section begins with the starting and ending identifiers of the objects in that section. With the cross-reference table, a PDF parser can randomly identify object offsets and quickly access object locations throughout the document without having to read the entire file.
Figure 3. XREF Table
Xref
0 9
0000000000 65535 f
0000000017 00000 n
0000000067 00000 n
0000001244 00000 n
0000001264 00000 n
0000001370 00000 n
0000002027 00000 n
0000009301 00000 n
0000009321 00000 n
0000009424 00000 n
The Trailer
Even though the trailer is technically the end of a PDF document, it is the first entry point that applications use to access the essential components of a PDF. The trailer contains pointers that parsers and applications use to locate the cross-reference table and other important objects in a PDF.
Examples of important objects include the root object (that identifies the beginning of a page tree) and info objects (that contain vital metadata).
Figure 4. Trailer
Trailer
<<
/Size 101
/Root 4 0 R
/Info 99 0 R
/ID [<50947B130F0E7443397A><50947B130F0E057443397A>]
>>
startxref
1052783
%%EOF
Call us at our Montreal Head Office, Tel: 1-866-926-9864 EST.
|