PDFlib

All Topics Overview

Text Extraction

Process the text contents of PDF documents
text_extractorSimple text extractor
concordanceCreate a list of all unique words in the document.
back_of_the_book_indexCreate a sorted list of all words in the document along with the page numbers where the words occur.
glyphinfoPrint text plus coordinates, fontname, fontsize and more.

Font Processing

Analyze font information in PDF documents
font_finder Identify the locations in a PDF where a particular font is used; print the page number, location, and start of text for each hit.
font_statisticsFont statistics

Image Extraction

Extract raster images from PDF documents
determine_image_resolution

Find out image resolutions.

image_count

Count images in a PDF according to various interpretations.

image_resources

Resource-based image extractor based on PDFlib TET

images in memorySimple image reader

images_per_page

PDF image extractor based on PDFlib TET.

image_orientationDetermine image orientation and mirroring.

TET and PDFlib

Modify or enhance PDF document with PDFlib+PDI based on their text contents
create_web_linksEnhance PDFs with TET and PDFlib+PDI.
create_bookmarksGenerate bookmarks based on specific page content.
burstSplit a document into smaller parts based on some page contents.
highlight_search_termsHighlight text on imported pages based on some criteria.
search_and_replace_textFind text with TET, hide it with a white rectangle, and add the replacement text on top of it.
create_table_of_contentsAutomatically create table of contents based on tyographic rules.
highlight unmapped glyphsHighlight unmapped glyphs (i.e. glyphs for which TET could not determine a Unicode mapping).
highlight fontsHighlight text in certain fonts.

TETML and XSLT

Convert PDF documents to TETML and process TETML with XSLT
tetmlSimple TETML converter
tetml2HTMLConvert TETML to HTML.
solrGenerate input for the Solr enterprise search server.
textonlyExtract raw text from TETML input.
metadataExtract XMP metadata from TETML.
tableExtract a table to CSV file.
concordanceCreate a concordance.
fontfilterWords in a document which use a particular font in a size larger than a specified value
fontfinderFont occurrences with page number and position
fontstatFont and glyph statistics
index"Back-of-the-book" index
colorspaceCreate a listing of all colorspaces used in the document.
fieldsCreate a listing of all fields used in the document.

Special

get_attachmentsExtract text and images from attachments.
identify_ocrClassify the pages in a document according to text or image content.
region_of_interestRestrict text extraction to a particular area on the page.
multiple_documentsProcess multiple documents in a loop.