PDFlib
PDFlib

All Topics Overview

Text Extraction

Process the text contents of PDF documents

extractor

Simple text extractor

concordance

Create a list of all unique words in the document.

back_of_the_book_index

Create a sorted list of all words in the document along with the page numbers where the words occur.

Font Processing

Analyze font information in PDF documents

font_finder 

Identify the locations in a PDF where a particular font is used; print the page number, location, and start of text for each hit

font_statistics

Font statistics

Image Extraction

Extract raster images from PDF documents

determine_image_resolution

Find out image resolutions

image_count

Count images in a PDF according to various interpretations

image_resources

Resource-based image extractor based on PDFlib TET

images in memory

Simple image reader

images_per_page

PDF image extractor based on PDFlib TET.

TET and PDFlib

Modify or enhance PDF document with PDFlib+PDI based on their text contents

create_web_links

Enhance PDFs with TET and PDFlib+PDI.

create_bookmarks

Generate bookmarks based on specific page content.

burst

Split a document into smaller parts based on some page contents.

highlight_search_terms

Highlight text on imported pages based on some criteria.

search_and_replace_text

Find text with TET, hide it with a white rectangle, and add the replacement text on top of it.

create_table_of_contents

Automatically create table of contents based on tyographic rules.

Highlight unmapped glyphs

Highlight unmapped glyphs (i.e. glyphs for which TET could not determine a Unicode mapping).

Highlight fonts

Highlight text in certain fonts.

TETML and XSLT

Convert PDF documents to TETML and process TETML with XSLT

tetml

Simple TETML converter

tetml2HTML

Convert TETML to HTML.

solr

Generate input for the Solr enterprise search server.

textonly

Extract raw text from TETML input.

metadata

Extract XMP metadata from TETML.

table

Extract a table to CSV file.

concordance

Create a concordance.

fontfilter

words in a document which use a particular font in a size larger than a specified value

fontfinder

font occurrences with page number and position

fontstat

font and glyph statistics

index

"back-of-the-book" index

Special

Other topics

get_attachments

Extract text and images from attachments.

identify_ocr

Classify the pages in a document according to text or image content.

region_of_interest

Restrict text extraction to a particular area on the page.

multiple_documents

Process multiple documents in a loop.