Technical Concepts

Technical Concepts in PDF/A

Fundamental PDF/A requirements

PDF/A requires certain PDF features and prohibits others:

  • To guarantee the exact visual reproduction of text all fonts used in a document must be embedded.
  • To guarantee exact color reproduction all colors must be specified in a device-independent way.
  • Metadata must be embedded using the XMP format. The PDF/A conformance level must be recorded with specific XMP properties.
  • Encryption must not be used to make sure that the documents contents can always be accessed without any restriction.
  • Certain requirements for annotations and form fields ensure that the visualization is fixed and that screen and print representation are identical.

In addition to these straight-forward requirements, however, PDF/A requires various other PDF features which are more subtle (e.g. certain entries in font data structures), and prohibits some critical structures (e.g. certain combinations of TrueType fonts and encodings). There are many aspects which must be implemented and checked by software developers before they arrive at fully standard-conforming PDF/A products. PDF/A is much more than simply »PDF with embedded fonts and no encryption«!

Additional restrictions in PDF/A-1

PDF/A-1 somehow suffers from the fact that it was the first in the PDF/A family: the standard was created at a time when important PDF concepts were not yet ready for prime time. As a result, the following features are prohibited in PDF/A-1, but are allowed in the newer parts PDF/A-2 and PDF/A-3:

  • All features which require PDF 1.5 or above, e.g. JPEG 2000 compression and layers (optional content).
  • Transparency: although transparency is possible in PDF 1.4, it was not considered suitable for archiving purposes at the time because there was no consistent description and implementation of transparency support available. Since identical behavior in all PDF viewers could not be guaranteed, it was decided to completely ban transparency from PDF/A-1. After the publication of PDF/A-1 the exact semantics of PDF transparency have been clarified and standardized in ISO 32000-1; later standards therefore very well allow the use of transparency.
  • File attachments were banned from PDF/A-1 to make sure that all document contents are fully archivable. While PDF/A-2 allows file attachments, it restricts them to PDF/A-1 or PDF/A-2 files to make sure that attached files can also faithfully be reproduced. PDF/A-3 further relaxes this rule to allow arbitrary file types as attachments.

Device-independent color specification

In order to ensure consistent color reproduction across output devices and time, PDF/A requires the use of device-independent color, usually achieved via ICC profiles or CIE Lab color specifications. The optional output intent describes the color characteristics of the document. While these concepts are widely used in the graphic arts industry, enterprise PDF developers are not necessarily familiar with color management and must familiarize themselves with ICC profiles and related concepts.

Raster images, e.g. TIFF and JPEG, play a vital role in document creation. Scanned paper documents and photographs from digital cameras are common examples of raster image data in document workflows. In many cases raster image data in modern workflows is already device-independent, usually by means of an embedded ICC color profile or standardized color spaces such as sRGB. Such images are ready for use in PDF/A. However, legacy image data is in many cases device-dependent, such as black-and-white or RGB scans without any associated ICC profile.

XMP metadata and extension schemas

Extensible Metadata Platform (XMP) is an XML-based format modeled after W3C’s RDF (Resource Description Framework) which forms the foundation of the semantic Web initiative. In 2012 XMP has been standardized as ISO 16684-1. PDF/A mandates the use of XMP metadata for storing information about a document inside the PDF itself. XMP provides a powerful and flexible framework for storing standard and custom metadata properties (see our separate Whitepaper on XMP).

The XMP specification includes more than a dozen predefined schemas with hundreds of properties for common document and image characteristics. The most widely used predefined XMP schema is called the Dublin Core. It includes properties such as Title, Creator, Subject, and Description.

XMP is extensible by its very nature, i.e. company- or industry-specific metadata requirements can be met by constructing custom schemas. PDF/A supports this concept. However, in order to ensure automated retrieval PDF/A mandates that a machine-readable description of the custom metadata must be embedded in the document. This is achieved with an »XMP extension schema description«: a standardized part of the XMP metadata describes the structure of custom XMP metadata properties.

Level A conformance: Tagged PDF

PDF/A-1a, PDF/A-2a and PDF/A-3a require the use of Tagged PDF. While plain PDF only places visible content on a page, Tagged PDF requires that the document’s logical structure is recorded within the structure hierarchy. Tagged PDF offers predefined structure element types for common parts of a document such as headings, tables, and lists. So-called marked content items can be considered the equivalent of tagged content in markup languages. They refer to elements in this structure tree. Similar to HTML and XML, Tagged PDF supports attributes for structure elements. For example, table elements can carry attributes regarding the row or column spanning properties of table cells.

Level A conformance also requires that all text in the document has Unicode semantics available (see below) and that logical words are separated by space characters.

PDF/UA-1 (Universal Accessibility) is a new standard which clarifies many aspects of Tagged PDF. It has been published in 2012 as ISO 14289. Although there is no direct relationship between both standards, a PDF/A document can at the same time conform to PDF/UA. In fact, if you want to create PDF/A with conformance level A we recommend to adhere to the PDF/UA requirements in order to improve accessibility.

For more information please refer to the PDF/UA Whitepaper. The following caveat applies to combined PDF/A and PDF/UA documents: since the Scope table attribute is required in PDF/UA, but not available in PDF 1.4 and therefore PDF/A-1, proper tagging of tables with header cells in not possible. We recommend to avoid PDF/A-1a if tables are involved, and work with the newer PDF/A-2a or PDF/A-3a standards instead.

Level U conformance: Unicode requirements

PDF/A-2 and PDF/A-3 offer level U conformance in addition to levels A and B. Level U requires proper Unicode semantics for all text in the document, but does not mandate Tagged PDF. This requirement is rooted in the fact that PDF supports a variety of font and encoding techniques, not all of which support Unicode. For example, PDF supports PostScript Type 1 fonts which have been introduced in the 1980’s, while the Unicode consortium started its work in 1991. PDF/A conformance levels A and U require that supplementary Unicode mapping information must be present for fonts which do not contain it internally. But not all Unicode values are acceptable: values in the Private Use Area (PUA) are not allowed since they do not carry any common interpretation (semantics).

Symbolic fonts are an important area where this PDF/A requirement holds, e.g. fonts containing logos or pictograms. Since standardized Unicode values are not available for custom symbolic glyphs, suitable Unicode semantics must be provided in an »ActualText« marked content attribute for the text. While this attribute is commonly used only in Tagged PDF, it can also be supplied in untagged documents - and this is what level U conformance requires. The ActualText may be assigned to an individual glyph or a sequence of multiple glyphs. It may consist of an arbitrary Unicode string.

As an example, code 0x1A in the common WingDings font contains an image of a computer keyboard with the glyph name keyboard and the PUA Unicode value U+F037. For lack of better substitute text the glyph name could be used to construct suitable ActualText, e.g. »symbol for keyboard«. It should be noted that programmatically constructing ActualText must be considered a makeshift solution; human-selected text is always preferable to machine-generated ActualText.