PDFlib TET 5 - New Features

The first version of PDFlib TET has been published in 2002. Since the initial release TET has solved the PDF content extraction problems of thousands of customers around the world. With the major release TET 5 we have further improved our solid extraction tool. Besides many PDF processing improvements there are many significant functional enhancements, mainly in the areas of image extraction, color retrieval and TETML contents.

What’s new in TET 5.5?

The features below are new or considerably improved in TET 5.5:

security and performance updates of third-party components
enhancements in all language bindings and updates for the latest language versions including .NET 8, PHP 8.3, Perl 5.38 and Ruby 3.2
several minor bug fixes and improvements

What's new in PDFlib TET 5.4?

The features below are new or considerably improved in TET 5.4:

security and performance updates of third-party components
enhancements in all language bindings and updates for the latest language versions including .NET 6/7, PHP 8.1/8.2, Perl 5.34/5.36 and Ruby 3.1
support for ARM64/x86_64 bindings on macOS
improved TIKA and MediaWiki connectors
many minor bug fixes and improvements

What's new in PDFlib TET 5.3?

The features below are new or considerably improved in TET 5.3:

optimized PDF resource handling to improve performance for documents with excessive numbers of images, patterns or other resources
security and performance updates of all third-party components
harden processing of damaged and illegal PDF documents by testing the full »Issue Tracker« PDF corpus with tens of thousands of »stressful PDF files«
expanded platform and CPU support including macOS on ARM64 and Linux on ARM64
timeout can be specified to limit processing time for large or complex files
enhancements in all language bindings and updates for the latest language versions including .NET 5, PHP 8, Perl 5.32 and Ruby 3.0
support for native UTF-8, UTF-16 and UTF-32 Unicode strings in C++17 and C++20
implement detection of certain kinds of attacks using legal PDF constructs which try to construct overly large data structure
improved TETML output for edge cases
improved word boundary, list and paragraph detection
support for Unicode 13
improved performance of the Classic .NET binding
many minor bug fixes and improvements
updated character collections and CMaps for PDF 2.0

What's new in PDFlib TET 5.2?

The features below are new or considerably improved in TET 5.2:

improved table detection with row and column span identification
mark Artifacts (irrelevant text and images) in TETML and the API
extract text and images from annotations and patterns
support for inline images and images in soft masks (graphics state with a Transparency Group XObject)
new language binding for .NET Core
enhancements in all language bindings and updates for the latest language versions
many bug fixes, improvements and workarounds for damaged PDF
security updates for third-party libraries
optionally retrieve Separation and DeviceN text colors in the simpler alternate color space instead of the rather complex native color space
minor extensions of the pCOS interface

What's new in PDFlib TET 5.1?

The features below are new or have been considerably improved in TET 5.1:

numbered and unnumbered lists are identified and expressed in TETML
repair mode for damaged input documents with cross-reference streams
improved workarounds for non-conforming input documents
improved performance for disabled image, color, and vector engines as well as for documents without layers
reduced memory requirements
other bug fixes
updated language bindings

What's new in PDFlib TET 5.0?

The features below are new or have been considerably improved in TET 5.

Text retrieval:

retrieve fill and stroke color of text
improved layout detection
honor vector graphics to improve page and table layout recognition
support vertical font metrics for CJK text

Image retrieval:

significantly enhanced merging of fragmented images, e.g. for rotated images
improved image handling for many special cases and rare PDF image flavors
extract image masks and soft masks
merge and convert JPEG 2000-compressed images
preserve spot color in extracted TIFF images
restrict image extraction to user-selected area
collect XMP image metadata stored in non-standard locations by InDesign

Page processing:

optionally ignore artifacts (irrelevant content) in Tagged PDF
honor layers (optional content) to avoid extraction of invisible content
honor clipping paths to avoid extraction of invisible content
check whether an area on the page is empty or contains any text, image, or vector graphics

TETML:

TETML includes fill and stroke color of glyphs
TETML includes information about interactive elements including annotations, form fields, bookmarks, actions, JavaScript, signatures, etc.
TETML includes color space and ICC profile details
TETML includes information about layers and page labels

pCOS PDF information retrieval:

pCOS pseudo objects for ICC profile details and image masking properties
pCOS pseudo objects for form fields

Other areas:

additional checks and heuristics for damaged and non-conforming PDF input
updated TET language bindings, programming samples, and TET connectors
new options for improved PDF processing control
many improvements in existing TET features

New in TET