New in TET

PDFlib TET 5 - New Features

The first version of PDFlib TET has been published in 2002. Since the initial release TET has solved the PDF content extraction problems of thousands of customers around the world. With the major release TET 5 we have further improved our solid extraction tool. Besides many PDF processing improvements there are many significant functional enhancements, mainly in the areas of image extraction, color retrieval and TETML contents.

What’s new in TET 5.5?

The features below are new or considerably improved in TET 5.5:

  • security and performance updates of third-party components
  • enhancements in all language bindings and updates for the latest language versions including .NET 8, PHP 8.3, Perl 5.38 and Ruby 3.2
  • several minor bug fixes and improvements

What's new in PDFlib TET 5.4?

The features below are new or considerably improved in TET 5.4:

  • security and performance updates of third-party components
  • enhancements in all language bindings and updates for the latest language versions including .NET 6/7, PHP 8.1/8.2, Perl 5.34/5.36 and Ruby 3.1
  • support for ARM64/x86_64 bindings on macOS
  • improved TIKA and MediaWiki connectors
  • many minor bug fixes and improvements

What's new in PDFlib TET 5.3?

The features below are new or considerably improved in TET 5.3:

  • optimized PDF resource handling to improve performance for documents with excessive numbers of images, patterns or other resources
  • security and performance updates of all third-party components
  • harden processing of damaged and illegal PDF documents by testing the full »Issue Tracker« PDF corpus with tens of thousands of »stressful PDF files«
  • expanded platform and CPU support including macOS on ARM64 and Linux on ARM64
  • timeout can be specified to limit processing time for large or complex files
  • enhancements in all language bindings and updates for the latest language versions including .NET 5, PHP 8, Perl 5.32 and Ruby 3.0
  • support for native UTF-8, UTF-16 and UTF-32 Unicode strings in C++17 and C++20
  • implement detection of certain kinds of attacks using legal PDF constructs which try to construct overly large data structure
  • improved TETML output for edge cases
  • improved word boundary, list and paragraph detection
  • support for Unicode 13
  • improved performance of the Classic .NET binding
  • many minor bug fixes and improvements
  • updated character collections and CMaps for PDF 2.0

What's new in PDFlib TET 5.2?

The features below are new or considerably improved in TET 5.2:

  • improved table detection with row and column span identification
  • mark Artifacts (irrelevant text and images) in TETML and the API
  • extract text and images from annotations and patterns
  • support for inline images and images in soft masks (graphics state with a Transparency Group XObject)
  • new language binding for .NET Core
  • enhancements in all language bindings and updates for the latest language versions
  • many bug fixes, improvements and workarounds for damaged PDF
  • security updates for third-party libraries
  • optionally retrieve Separation and DeviceN text colors in the simpler alternate color space instead of the rather complex native color space
  • minor extensions of the pCOS interface

What's new in PDFlib TET 5.1?

The features below are new or have been considerably improved in TET 5.1:

  • numbered and unnumbered lists are identified and expressed in TETML
  • repair mode for damaged input documents with cross-reference streams
  • improved workarounds for non-conforming input documents
  • improved performance for disabled image, color, and vector engines as well as for documents without layers
  • reduced memory requirements
  • other bug fixes
  • updated language bindings

What's new in PDFlib TET 5.0?

The features below are new or have been considerably improved in TET 5.

Text retrieval:

  • retrieve fill and stroke color of text
  • improved layout detection
  • honor vector graphics to improve page and table layout recognition
  • support vertical font metrics for CJK text

Image retrieval:

  • significantly enhanced merging of fragmented images, e.g. for rotated images
  • improved image handling for many special cases and rare PDF image flavors
  • extract image masks and soft masks
  • merge and convert JPEG 2000-compressed images
  • preserve spot color in extracted TIFF images
  • restrict image extraction to user-selected area
  • collect XMP image metadata stored in non-standard locations by InDesign

Page processing:

  • optionally ignore artifacts (irrelevant content) in Tagged PDF
  • honor layers (optional content) to avoid extraction of invisible content
  • honor clipping paths to avoid extraction of invisible content
  • check whether an area on the page is empty or contains any text, image, or vector graphics

TETML:

  • TETML includes fill and stroke color of glyphs
  • TETML includes information about interactive elements including annotations, form fields, bookmarks, actions, JavaScript, signatures, etc.
  • TETML includes color space and ICC profile details
  • TETML includes information about layers and page labels

pCOS PDF information retrieval:

  • pCOS pseudo objects for ICC profile details and image masking properties
  • pCOS pseudo objects for form fields

Other areas:

  • additional checks and heuristics for damaged and non-conforming PDF input
  • updated TET language bindings, programming samples, and TET connectors
  • new options for improved PDF processing control
  • many improvements in existing TET features