Font Processing | |
font_finder | Identify the locations in a PDF where a particular font is used. Print the page number, location, and start of text for each hit. |
font_statistics | Font statistics |
Image Extraction | |
determine_image_resolution | Find out image resolutions. |
image_count | Count images in a PDF according to various interpretations. |
image_orientation | Determine image orientation and mirroring. |
image_resources | Resource-based image extractor based on PDFlib TET. |
images_in_memory | Simple image reader |
images_per_page | PDF image extractor based on PDFlib TET. |
Special | |
emptycheck | Check whether a specified area on the page is empty, i.e. does not contain any text, vector graphics or image. |
extract_highlighted_text | Extract text under Highlight annotations. |
get_attachments | Extract the text from the document and recursively from all embedded PDF attachments. |
identify_ocr | Classify the pages in a document based on the page content. |
multiple_documents | Generalized form of the simple text extractor for multiple documents. |
region_of_interest | Restrict text extraction to a particular "region of interest", i.e. some area on the page based on knowledge about the document layout. |
TET and PDFlib | |
burst | Split a document into smaller parts based on some page contents. |
create_bookmarks | Use TET and PDFLib to generate bookmarks based on page content. |
create_table_of_contents | Use TET and PDFlib to create a table of contents (TOC) for the original document. |
create_web_links | Use TET and PDFlib to create Web links based on the text contents. |
highlight_artifacts | Use TET and PDFlib to search for text and image Artifacts and make them visible with the "Highlight" annotation |
highlight_fonts | Use TET and PDFLib to search for fonts and make them visible with the "Highlight" annotation |
highlight_search_terms | Use TET and PDFlib to identify all occurrences of a particular word, and make them visible with the "Highlight" annotation. |
highlight_unmapped_glyphs | Use TET and PDFlib to find all glyphs for which TET could not determine a Unicode mapping, and make them visible with the "Highlight" annotation. |
search_and_replace_text | Find text with TET, hide it with a white rectangle, and add the replacement text on top of it. |
TETML and XSLT | |
colorspaces | Create a listing of all colorspaces used in the document. |
concordance | Create a concordance. |
fields | Create a listing of all fields used in the document. |
fontfilter | Words in a document which use a particular font in a size larger than a specified value |
fontfinder | Font occurrences with page number and position |
fontstat | Font and glyph statistics |
index | "Back-of-the-book" index |
metadata | Extract XMP metadata from TETML. |
solr | Generate input for the Solr enterprise search server. |
table | Extract a table to CSV file. |
tetml | Extract text from PDF document as XML. |
tetml2html | Convert TETML to HTML. |
textonly | Extract raw text from TETML input. |
Text Extraction | |
back_of_the_book_index | Create a sorted list of all words in the document along with the page numbers where the words occur. |
concordance | Create a sorted list of unique words in a document along with counts. |
glyphinfo | Simple PDF glyph dumper based on PDFlib TET. |
text_extractor | PDF text extractor based on PDFlib TET. |
text_from_annotations | Extract text from annotations with PDFlib TET and the pCOS interface. |