Tagged PDF Basics

Tagged PDF Basics

Since PDF/UA is based on Tagged PDF let’s take a look at this first. While PDF initially has been designed to faithfully preserve the graphical appearance of a document, it doesn’t necessarily contain information about a document’s structure. For example, section headings may be printed in large or bold type, but there is no explicit »heading« marker available in plain PDF documents. Similar to XML-based markup languages, content in Tagged PDF can be »marked« and incorporated in a structural document hierarchy. Each relevant content item has a designated place in this hierarchy. Content items which don’t contribute to the vital document contents (e.g. page numbers) are marked as Artifact.

The logical structure in a Tagged PDF document is described by a hierarchy of elements, called the structure hierarchy (or logical structure tree, or tag tree). Starting at the root level (often called the Document element), the structure hierarchy consists of an arbitrary number of levels. On each level an element may contain zero or more items of the following kinds:

  • Other structure elements, e.g. the Document element may contain multiple Article elements, and each Article element may in turn contain multiple P (paragraph) elements.
  • Content items such as marked sequences of text and graphics on the page, XObjects created from imported images or PDF pages, or annotations and form fields. These items represent the content associated with a structure element.

Tagged PDF solves the potential conflict between content creation order and logical reading order: the contents of a PDF page can be placed in any order, but reading text in this native order does not necessarily match the logical order of page contents. In contrast, the logical structure tree arranges page contents according to their logical order, i.e. the order in which a human expects to read it.