TET Cookbook

cookbook

Font Processing

font_finder Identify the locations in a PDF where a particular font is used. Print the page number, location, and start of text for each hit.
font_statistics Font statistics

Image Extraction

determine_image_resolution Find out image resolutions.
image_count Count images in a PDF according to various interpretations.
image_orientation Determine image orientation and mirroring.
image_resources Resource-based image extractor based on PDFlib TET.
images_in_memory Simple image reader
images_per_page PDF image extractor based on PDFlib TET.

Special

emptycheck Check whether a specified area on the page is empty, i.e. does not contain any text, vector graphics or image.
extract_highlighted_text Extract text under Highlight annotations.
get_attachments Extract the text from the document and recursively from all embedded PDF attachments.
identify_ocr Classify the pages in a document based on the page content.
multiple_documents Generalized form of the simple text extractor for multiple documents.
region_of_interest Restrict text extraction to a particular "region of interest", i.e. some area on the page based on knowledge about the document layout.

TET and PDFlib

burst Split a document into smaller parts based on some page contents.
create_bookmarks Use TET and PDFLib to generate bookmarks based on page content.
create_table_of_contents Use TET and PDFlib to create a table of contents (TOC) for the original document.
create_web_links Use TET and PDFlib to create Web links based on the text contents.
highlight_artifacts Use TET and PDFlib to search for text and image Artifacts and make them visible with the "Highlight" annotation
highlight_fonts Use TET and PDFLib to search for fonts and make them visible with the "Highlight" annotation
highlight_search_terms Use TET and PDFlib to identify all occurrences of a particular word, and make them visible with the "Highlight" annotation.
highlight_unmapped_glyphs Use TET and PDFlib to find all glyphs for which TET could not determine a Unicode mapping, and make them visible with the "Highlight" annotation.
search_and_replace_text Find text with TET, hide it with a white rectangle, and add the replacement text on top of it.

TETML and XSLT

colorspaces Create a listing of all colorspaces used in the document.
concordance Create a concordance.
fields Create a listing of all fields used in the document.
fontfilter Words in a document which use a particular font in a size larger than a specified value
fontfinder Font occurrences with page number and position
fontstat Font and glyph statistics
index "Back-of-the-book" index
metadata Extract XMP metadata from TETML.
solr Generate input for the Solr enterprise search server.
table Extract a table to CSV file.
tetml Extract text from PDF document as XML.
tetml2html Convert TETML to HTML.
textonly Extract raw text from TETML input.

Text Extraction

back_of_the_book_index Create a sorted list of all words in the document along with the page numbers where the words occur.
concordance Create a sorted list of unique words in a document along with counts.
glyphinfo Simple PDF glyph dumper based on PDFlib TET.
text_extractor PDF text extractor based on PDFlib TET.
text_from_annotations Extract text from annotations with PDFlib TET and the pCOS interface.