PDFlib TET 5 – Text and Image Extraction Toolkit

Extract text from PDF: PDFlib TET PDF IFilter

What is PDFlib TET?

PDFlib TET (Text and Image Extraction Toolkit) reliably extracts text, images and metadata from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed color, glyph and font information as well as the position on the page. Raster images are extracted in common image formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information.

TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.

With PDFlib TET you can:

Implement the PDF indexer for a search engine

Repurpose text and images in PDFs

Convert the contents of PDFs to other formats

Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)

Check wether an area on the page is empty or contains any text, image, or vector graphics

TET Product Family

The TET family comprises the following products:

Text and Image Extraction Toolkit (TET), the core product for extracting text, images, metadata and other elements from PDF.

TET PDF IFilter extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows. It is available as a separate product and is suitable for use with Microsoft search products, e.g. Windows Search, SharePoint and SQL Server.

TET Plugin for Adobe Acrobat, a free utility for extracting text and images from PDF. It can be used to evaluate TET interactively.