TET Cookbook

cookbook

Font Processing
font_finder	Identify the locations in a PDF where a particular font is used. Print the page number, location, and start of text for each hit.
font_statistics	Font statistics

Image Extraction
determine_image_resolution	Find out image resolutions.
image_count	Count images in a PDF according to various interpretations.
image_orientation	Determine image orientation and mirroring.
image_resources	Resource-based image extractor based on PDFlib TET.
images_in_memory	Simple image reader
images_per_page	PDF image extractor based on PDFlib TET.

Special
emptycheck	Check whether a specified area on the page is empty, i.e. does not contain any text, vector graphics or image.
extract_highlighted_text	Extract text under Highlight annotations.
get_attachments	Extract the text from the document and recursively from all embedded PDF attachments.
identify_ocr	Classify the pages in a document based on the page content.
multiple_documents	Generalized form of the simple text extractor for multiple documents.
region_of_interest	Restrict text extraction to a particular "region of interest", i.e. some area on the page based on knowledge about the document layout.

TET and PDFlib
burst	Split a document into smaller parts based on some page contents.
create_bookmarks	Use TET and PDFLib to generate bookmarks based on page content.
create_table_of_contents	Use TET and PDFlib to create a table of contents (TOC) for the original document.
create_web_links	Use TET and PDFlib to create Web links based on the text contents.
highlight_artifacts	Use TET and PDFlib to search for text and image Artifacts and make them visible with the "Highlight" annotation
highlight_fonts	Use TET and PDFLib to search for fonts and make them visible with the "Highlight" annotation
highlight_search_terms	Use TET and PDFlib to identify all occurrences of a particular word, and make them visible with the "Highlight" annotation.
highlight_unmapped_glyphs	Use TET and PDFlib to find all glyphs for which TET could not determine a Unicode mapping, and make them visible with the "Highlight" annotation.
search_and_replace_text	Find text with TET, hide it with a white rectangle, and add the replacement text on top of it.

TETML and XSLT
colorspaces	Create a listing of all colorspaces used in the document.
concordance	Create a concordance.
fields	Create a listing of all fields used in the document.
fontfilter	Words in a document which use a particular font in a size larger than a specified value
fontfinder	Font occurrences with page number and position
fontstat	Font and glyph statistics
index	"Back-of-the-book" index
metadata	Extract XMP metadata from TETML.
solr	Generate input for the Solr enterprise search server.
table	Extract a table to CSV file.
tetml	Extract text from PDF document as XML.
tetml2html	Convert TETML to HTML.
textonly	Extract raw text from TETML input.

Text Extraction
back_of_the_book_index	Create a sorted list of all words in the document along with the page numbers where the words occur.
concordance	Create a sorted list of unique words in a document along with counts.
glyphinfo	Simple PDF glyph dumper based on PDFlib TET.
text_extractor	PDF text extractor based on PDFlib TET.
text_from_annotations	Extract text from annotations with PDFlib TET and the pCOS interface.