Free TET Plugin



The TET Plugin provides easy access to the PDFlib Text Extraction Toolkit (TET). Although the TET Plugin runs as an Acrobat plugin, the underlying text extraction does not use Acrobat functions, but is completely based on TET. The TET Plugin is provided as a technology study to demonstrate the power of PDFlib TET. Since TET is more powerful than Acrobat’s built-in text extractor, and the TET Plugin offers a number of convenient user interface features, it is useful as a replacement for Acrobat’s built-in copy and find features. PDFlib TET can successfully process documents for which Acrobat provides only garbage when trying to extract the text. The TET Plugin provides the following functions:

Copy the text from a PDF document in plain text, RTF, or XML formats to the system clipboard or a disk file. Enhanced clipboard controls facilitate the use of copy/paste.

Copy bookmarks from a PDF document.

Copy XMP document metadata.

Find words in the document.

Detailed configuration settings are available to adjust text extraction to your requirements. Configuration sets can be saved and reloaded.


Advantages over Acrobat’s copy function

The TET Plugin offers several advantages over Acrobat’s built-in copy facility:

The output can be customized to match different application requirements.

TET is able to correctly interpret the text in many cases where Acrobat copies only garbage to the clipboard.

Unknown glyphs (for which proper Unicode mapping cannot be established) will be highlighted in red color, and can be replaced with a user-selected character (e.g. question mark).

TET processes documents much faster than Acrobat.


What is PDFlib TET?

The PDFlib Text Extraction Toolkit (TET) is the underlying engine of the TET Plugin. TET is a developer product for reliably extracting text from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. In addition, TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text, such as shadows or artificially bolded text. Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc. With PDFlib TET you can:

Implement a search engine for processing PDF;

Extract text from PDFs, e.g. to store it in a database;

Convert text contents of PDFs to other formats, such as XML;

Process PDFs based on their contents.

TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features, but are suitable for different deployment tasks.
Fully functional evaluation versions of PDFlib TET for a variety of platforms are available here.