TET Features

PDFlib TET 5 - Features

The PDFlib Text and Image Extraction Toolkit (TET) is targeted at extracting text and images from PDF documents, but can also be used to retrieve other information from PDF.

PDFlib TET has been designed for stand-alone use, and does not require any third-party software. It is robust and suitable for multi-threaded server use; see how to use TET.

PDFlib TET provides the following powerful features and offers unique advantages for text extraction as well as unique advantages for image extraction.

Accepted PDF Input

TET supports all flavors of PDF input:

  • All PDF versions up to to Acrobat DC, including ISO 32000-1 and -2 (PDF 2.0)
  • Protected PDFs which do not require a password for opening or for which a password is available
  • Damaged PDF documents are repaired

All Writing Systems of the World

TET processes PDF documents in all writing systems of the world and implements special processing required for some scripts:

  • Latin, Greek and Cyrillic scripts
  • Arabic and Hebrew including logical reordering of right-to-left and bidirectional text; normalization of Arabic presentation forms
  • Simplified and Traditional Chinese, Japanese, and Korean regardless of encoding; horizontal and vertical text
  • Indic scripts (without glyph reordering)
  • All other languages and scripts supported with Unicode output

Unicode

Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:

  • TET converts all text contents to Unicode, regardless of the encoding method used in the PDF document.
  • Ligatures and other multi-character glyphs are decomposed into a sequence of the corresponding Unicode characters.
  • Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.
  • TET implements various workarounds for problems with specific document creation packages, such as InDesign and TeX documents or PDFs generated on mainframe systems.

Content Analysis and Word Detection

TET includes patented content analysis algorithms:

  • Determine word boundaries which are required to retrieve proper words
  • Combine the parts of hyphenated words (dehyphenation)
  • Remove duplicate instances of text, e.g. shadow and artificially bolded text
  • Recombine paragraphs in reading order
  • Correctly order text which is scattered over the page

Page Layout, Table and List Detection

The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple rows or columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified. Bulleted and numbered lists are identified.

Geometry

TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

Text Color

TET analyzes color information in the PDF page description and returns precise color information for each glyph. This can be used, for example, to identify headings or other highlighted text. Optionally the advanced color spaces Separation and DeviceN can be extracted in a simpler alternate color space.

Image Extraction

Images on PDF pages can be extracted as TIFF, JPEG, JBIG2 or JPEG 2000 files. Precise geometric information (position, size, and angles) is reported for each image. Fragmented images are combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color conversion occurs. This ensures the highest possible image quality.

Ignore Artifacts in Tagged PDF

In Tagged PDF, especially PDF/UA, irrelevant content may be tagged as Artifact, e.g. headers and footers. TET optionally ignores Artifact text and images.

PDF Analysis with the pCOS Interface

The TET library includes the pCOS interface for querying details about a PDF document, such as document info and XMP metadata, font lists, page size, and many more.

Unicode Postprocessing

TET supports various Unicode postprocessing steps which can be used to improve the extracted text:

  • Foldings preserve, remove or replace characters, e.g. remove punctuation or characters from irrelevant scripts.
  • Decompositions replace a character with an equivalent sequence of one or more other characters, e.g. replace narrow, wide or vertical Japanese characters or Latin superscript variants with their respective standard counterparts.
  • Text can be converted to all Unicode normalization forms, e.g. emit NFC form to meet the requirements for Web text or a database.

Document Domains

PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:

  • page contents
  • predefined and custom document info entries
  • XMP metadata on document and image level
  • bookmarks
  • file attachments and PDF portfolios are processed recursively
  • form fields
  • comments (annotations)
  • general PDF properties can be queried, such as page count, conformance to standards like PDF/A or PDF/X, etc.

XMP Metadata

TET supports XMP metadata in several ways:

  • Using the integrated pCOS interface, XMP metadata for the document, individual pages, images, or other parts of the document can be extracted programmatically.
  • TETML output contains XMP document and image metadata.
  • Images extracted in the TIFF or JPEG formats contain image metadata if present in the PDF.

TETML represents PDF Contents as XML

TET optionally represents the PDF contents in an XML flavor called TETML. It contains a variety of PDF information in a form which can be processed with common XML tools. TETML contains the text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata.

TETML also includes interactive elements such as form fields, annotations, bookmarks etc. It can even be used to analyze JavaScript or color space details, ICC profiles or output intents.

TETML can be processed with XSLT stylesheets, e.g. to apply filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.

The following fragment shows TETML output with glyph details:

 

<Word>
<Text>PDFlib</Text>
<Box llx="111.48" lly="636.33" urx="161.14" ury="654.33">
<Glyph font="F1" size="18" x="111.48" y="636.33" width="9.65">P</Glyph>
<Glyph font="F1" size="18" x="121.12" y="636.33" width="11.88">D</Glyph>
<Glyph font="F1" size="18" x="133.00" y="636.33" width="8.33">F</Glyph>
<Glyph font="F1" size="18" x="141.33" y="636.33" width="4.88">l</Glyph>
<Glyph font="F1" size="18" x="146.21" y="636.33" width="4.88">i</Glyph>
<Glyph font="F1" size="18" x="151.08" y="636.33" width="10.06">b</Glyph>
</Box>
</Word>

 

TETML can include information about word and paragraph grouping as well as about tables and lists, image placement and annotations along with geometric information for these elements.

TET Connectors

TET connectors interface TET with other software. They make PDF text extraction functionality available for various environments:

  • TET connector for the Lucene Search Engine
  • TET connector for the Solr Search Server
  • TET connector for the TIKA toolkit
  • TET connector for Oracle Text
  • TET connector for MediaWiki
  • TET PDF IFilter for Microsoft products is available as a separate product. It extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows.

TET Cookbook

The TET Cookbook is a collection of programming examples which demonstrate the use of TET for various text and image extraction tasks. Several Cookbook samples show how to combine the TET and PDFlib+PDI products in order to enhance PDF documents, e.g. add bookmarks or links based on the text on the page.