Technical Concepts in PDF/A
Fundamental PDF/A requirements
PDF/A requires certain PDF features and prohibits others:
- To guarantee the exact visual reproduction of text all fonts used in a document must be embedded. The only exception are fonts used for invisible text; these don't have to be embedded.
- To guarantee exact color reproduction all colors must be specified in a device-independent way.
- Metadata must be embedded using the XMP format. The PDF/A conformance level must be recorded with specific XMP properties. While PDF/A-1/2/3 impose strict requirements on custom metadata properties, this has been relaxed in PDF/A-4.
- Encryption is not allowed to make sure that that the document contents can always be accessed without any restriction.
- Certain requirements for annotations and form fields ensure that the visualization is fixed and that screen and print representation are identical.
In addition to these straight-forward requirements, however, PDF/A requires various other PDF features which are more subtle (e.g. certain entries in font data structures), and prohibits some critical structures, e.g. certain combinations of TrueType fonts and encodings without guaranteed rendering results. There are many aspects which must be implemented and checked by software developers before they arrive at fully standard-conforming PDF/A products. PDF/A is much more than simply »PDF with embedded fonts and no encryption«.
Specific restrictions in PDF/A-1
PDF/A-1 reflects the fact that it was the first in the PDF/A family: the standard was created at a time when important PDF concepts were not yet ready for prime time. As a result, the following features are prohibited in PDF/A-1, but are allowed in the newer parts:
- All features which require PDF 1.5 or above, e.g. JPEG 2000 compression and layers (optional content).
- Transparency: although transparency is possible in PDF 1.4, it was not considered suitable for archiving purposes at the time because there was no consistent description of transparency support available. Since identical behavior in all PDF viewers could not be guaranteed transparency was completely banned from PDF/A-1. After the publication of PDF/A-1 the exact semantics of PDF transparency have been clarified and standardized in ISO 32000-1; later standards therefore very well allow the use of transparency.
- File attachments were banned from PDF/A-1 to make sure that all document contents are fully archivable.
Device-independent color specification
In order to ensure consistent color reproduction across output devices and time, PDF/A requires the use of device-independent color, usually achieved via ICC color profiles or CIE Lab color specifications. The optional output intent describes the color characteristics of the document with an ICC profile. While these concepts are widely used in the graphic arts industry, enterprise PDF developers are not necessarily familiar with color management and must familiarize themselves with ICC profiles and related concepts.
Raster images, e.g. TIFF and JPEG, play a vital role in document creation. Scanned paper documents and photographs from digital cameras are common examples of raster image data in document workflows. Often raster image data is already device-independent, usually by means of an embedded ICC color profile or standardized color spaces such as sRGB. Such images are ready for use in PDF/A. However, legacy image data is in many cases device-dependent, such as black-and-white or RGB scans without an associated ICC profile.
XMP metadata and extension schemas in PDF/A-1/2/3
Extensible Metadata Platform (XMP) is an XML-based format modeled after W3C’s RDF (Resource Description Framework) which forms the foundation of the semantic Web initiative. In 2012 XMP has been standardized as ISO 16684-1. PDF/A mandates the use of XMP metadata for storing information about a document inside the PDF itself. XMP provides a powerful and flexible framework for storing standard and custom metadata properties.
The XMP specification includes more than a dozen predefined schemas with hundreds of properties for common document and image characteristics. The most widely used predefined XMP schema is called the Dublin Core. It includes properties such as Title, Creator, Subject, and Description.
XMP is extensible by its nature, i.e. company- or industry-specific metadata requirements can be addressed with custom schemas. PDF/A supports this concept. However, in order to ensure automated retrieval PDF/A mandates that a machine-readable description of custom metadata must be included in the metadata. This is achieved with an »XMP extension schema description«: a standardized part of the XMP metadata describes the structure of custom XMP metadata properties.
Metadata in PDF/A-4
The convoluted concept of XMP extension schemas introduced with PDF/A-1 didn’t really catch on with developers and users. The industry had to struggle for several years to work out those details about extension schema processing which were missing from the standard text. This led to frustration, since on the one hand it was hard to correctly add custom metadata properties to PDF/A, and on the other hand applications which didn’t use custom properties nevertheless triggered XMP-related errors in PDF/A validators. PDF/A-4 eliminates these problems in a radical way by completely getting rid of XMP extension schema descriptions. They are replaced with a machine-readable schema description according to the Relax NG standard, published in 2014 as ISO 16684-2. However, unlike the required extension schemas in PDF/A-1/2/3, schema descriptions are optional in PDF/A-4.
Another source of problems was the requirement to synchronize XMP metadata with entries in the document information dictionary. This so-called crosswalk was underspecified and even got some details wrong in the first published version of PDF/A-1. Since PDF 2.0, the basis of PDF/A-4, almost completely deprecates document info entries, PDF/A-4 no longer requires metadata synchronization.
PDF/A-1/2/3 Level A conformance: Tagged PDF
PDF/A-1a, PDF/A-2a and PDF/A-3a require the use of Tagged PDF. While plain PDF only places visible contents on a page, Tagged PDF requires that the document’s logical structure is recorded within the structure hierarchy. Tagged PDF offers predefined structure element types for common parts of a document such as headings, tables and lists. So-called marked content items can be considered the equivalent of tagged content in markup languages. They refer to elements in this structure tree. Similar to HTML and XML, Tagged PDF supports attributes for structure elements. For example, table elements can carry attributes regarding the row or column spanning properties of table cells.
Level A conformance also requires that all text in the document has Unicode semantics available (see below) and that logical words are separated by space characters.
PDF/UA-1 (Universal Accessibility) clarifies many aspects of Tagged PDF. It has been published in 2012 as ISO 14289. Although there is no direct relationship between both standards, a PDF/A document can at the same time conform to PDF/UA. In fact, if you want to create PDF/A-1/2/3 with conformance level A we recommend to adhere to the PDF/UA requirements in order to improve accessibility.
PDF/A-4 abandons level A conformance and simply mentions the advantages of Tagged PDF for content recovery. The standard references PDF/UA for further guidance, i.e. the recommendation above is now included in the standard.
PDF/A-2/3 Level U conformance: Unicode requirements
PDF/A-2 and PDF/A-3 offer level U conformance in addition to levels A and B. Level U requires proper Unicode semantics for all text in the document, but does not mandate Tagged PDF. This requirement is rooted in the fact that PDF supports a variety of font and encoding techniques, not all of which support Unicode. For example, PDF supports PostScript Type 1 fonts, a format which is deprecated or no longer supported in many current operating systems and applications. This format has been introduced in the 1980’s, while the Unicode consortium started its work in 1991. PDF/A conformance levels A and U require that supplementary Unicode mapping information must be present for fonts which do not contain it internally. But not all Unicode values are acceptable: values in the Private Use Area (PUA) are not allowed since they don't carry any common interpretation.
Symbolic fonts are an important area where this PDF/A requirement holds, e.g. fonts containing logos or pictograms. Since standardized Unicode values are not available for custom symbolic glyphs, suitable Unicode semantics must be provided in an ActualText marked content attribute for the text. While this attribute is commonly used only in Tagged PDF, it can also be supplied in untagged documents - and this is what level U conformance requires. The ActualText may be assigned to an individual glyph or a sequence of multiple glyphs.
PDF/A-4 eliminates level U conformance, but recommends level U Unicode properties for all documents. However, this is not a strict requirement.
Annotations and PDF/A-4 Level E conformance
PDF supports a variety of annotation types (also called comments) which enrich documents. Some annotation types are prohibited in PDF/A; allowed annotations must adhere to several rules.
In PDF/A-1 Sound and Movie annotations are not permitted since »support for multimedia content is outside the scope« of the standard. In the same spirit PDF/A-2 and PDF/A-3 disallow the newer 3D and Screen annotation types. PDF/A-4 prohibits Sound, Screen and Movie annotations.
In addition, PDF/A-4 introduces conformance level E. It can be considered the successor of the PDF/E standard for PDF in engineering which didn’t find widespread adoption. PDF/A-4e allows 3D and Rich-Media annotations in support of interactive applications. Regarding 3D data the standard recommends RichMedia annotations instead of 3D annotations.
Another new condition in PDF/A-4 which stems from PDF 2.0 is the requirement to have annotation appearances included in the document. These describe the graphical representation of an appearance. An annotation dictionary contains individual graphical properties such as border style, color, font etc. for its graphical representation. Optionally a complete description of the annotations's complete graphical appearance (called appearance stream) may be present. In this case the annotation is displayed identically in all PDF viewers. However, if the annotation appearance is absent the viewer must reconstruct it from the graphical properties. Since this process is not standardized the visual result varies among PDF viewers.
In order to avoid such display differences PDF/A-4 requires the presence of annotation appearance streams for all annotation types except Popup and Link.
File Attachments and PDF/A-4 Level F conformance
Attachments can be embedded in a PDF document on the document level or on a page with the help of FileAttachment annotations. Rules for embedded files differ substantially among PDF/A parts:
- PDF/A-1 completely prohibits attachments.
- PDF/A-2 allows attachments, but the embedded documents must conform to PDF/A-1 or PDF/A-2.
- PDF/A-3 allows attachments with arbitrary content types.
- PDF/A-4 allows attachments which conform to PDF/A-1, PDF/A-2 or PDF/A-4. It also introduces a dedicated conformance level F which allows arbitrary content types.
Digital signatures
Digital Signatures in PDF documents can be used to check the document’s integrity, authenticate the person who created the signature, and determine the date and time of signature. Digital signatures are part of PDF 1.4 and are allowed in PDF/A. Multiple document signatures using PDF’s incremental update feature are also allowed. However, the signatures must meet certain requirements for PDF/A:
- If the signature has a visual appearance (e.g. an image or a textual representation of the signer’s name) this appearance must meet the same PDF/A requirements as other document parts (device-independent color, fonts embedded, etc.).
- PDF/A-2 and PDF/A-3 contain additional requirements regarding technical details of the signature. The standard also recommends to include timestamps and certificate revocation information in the signature.
- PDF/A-4 allows one certification signature, one or more approval signatures and one or more timestamp signatures. All signatures must conform to an appropriate PAdES profile.