TET Cookbook

cookbook

Font Processing

font_finderIdentify the locations in a PDF where a particular font is used. Print the page number, location, and start of text for each hit.
font_statisticsFont statistics

Image Extraction

determine_image_resolutionFind out image resolutions.
image_countCount images in a PDF according to various interpretations.
image_orientationDetermine image orientation and mirroring.
image_resourcesResource-based image extractor based on PDFlib TET.
images_in_memorySimple image reader
images_per_pagePDF image extractor based on PDFlib TET.

Special

emptycheckCheck whether a specified area on the page is empty, i.e. does not contain any text, vector graphics or image.
extract_highlighted_textExtract text under Highlight annotations.
get_attachmentsExtract the text from the document and recursively from all embedded PDF attachments.
identify_ocrClassify the pages in a document based on the page content.
multiple_documentsGeneralized form of the simple text extractor for multiple documents.
region_of_interestRestrict text extraction to a particular "region of interest", i.e. some area on the page based on knowledge about the document layout.

TET and PDFlib

burstSplit a document into smaller parts based on some page contents.
create_bookmarksUse TET and PDFLib to generate bookmarks based on page content.
create_table_of_contentsUse TET and PDFlib to create a table of contents (TOC) for the original document.
create_web_linksUse TET and PDFlib to create Web links based on the text contents.
highlight_artifactsUse TET and PDFlib to search for text and image Artifacts and make them visible with the "Highlight" annotation
highlight_fontsUse TET and PDFLib to search for fonts and make them visible with the "Highlight" annotation
highlight_search_termsUse TET and PDFlib to identify all occurrences of a particular word, and make them visible with the "Highlight" annotation.
highlight_unmapped_glyphsUse TET and PDFlib to find all glyphs for which TET could not determine a Unicode mapping, and make them visible with the "Highlight" annotation.
search_and_replace_textFind text with TET, hide it with a white rectangle, and add the replacement text on top of it.

TETML and XSLT

colorspacesCreate a listing of all colorspaces used in the document.
concordanceCreate a concordance.
fieldsCreate a listing of all fields used in the document.
fontfilterWords in a document which use a particular font in a size larger than a specified value
fontfinderFont occurrences with page number and position
fontstatFont and glyph statistics
index"Back-of-the-book" index
metadataExtract XMP metadata from TETML.
solrGenerate input for the Solr enterprise search server.
tableExtract a table to CSV file.
tetmlExtract text from PDF document as XML.
tetml2htmlConvert TETML to HTML.
textonlyExtract raw text from TETML input.

Text Extraction

back_of_the_book_indexCreate a sorted list of all words in the document along with the page numbers where the words occur.
concordanceCreate a sorted list of unique words in a document along with counts.
glyphinfoSimple PDF glyph dumper based on PDFlib TET.
text_extractorPDF text extractor based on PDFlib TET.
text_from_annotationsExtract text from annotations with PDFlib TET and the pCOS interface.