Font Processing | |
| font_finder | Identify the locations in a PDF where a particular font is used. Print the page number, location, and start of text for each hit. |
| font_statistics | Font statistics |
Image Extraction | |
| determine_image_resolution | Find out image resolutions. |
| image_count | Count images in a PDF according to various interpretations. |
| image_orientation | Determine image orientation and mirroring. |
| image_resources | Resource-based image extractor based on PDFlib TET. |
| images_in_memory | Simple image reader |
| images_per_page | PDF image extractor based on PDFlib TET. |
Special | |
| emptycheck | Check whether a specified area on the page is empty, i.e. does not contain any text, vector graphics or image. |
| extract_highlighted_text | Extract text under Highlight annotations. |
| get_attachments | Extract the text from the document and recursively from all embedded PDF attachments. |
| identify_ocr | Classify the pages in a document based on the page content. |
| multiple_documents | Generalized form of the simple text extractor for multiple documents. |
| region_of_interest | Restrict text extraction to a particular "region of interest", i.e. some area on the page based on knowledge about the document layout. |
TET and PDFlib | |
| burst | Split a document into smaller parts based on some page contents. |
| create_bookmarks | Use TET and PDFLib to generate bookmarks based on page content. |
| create_table_of_contents | Use TET and PDFlib to create a table of contents (TOC) for the original document. |
| create_web_links | Use TET and PDFlib to create Web links based on the text contents. |
| highlight_artifacts | Use TET and PDFlib to search for text and image Artifacts and make them visible with the "Highlight" annotation |
| highlight_fonts | Use TET and PDFLib to search for fonts and make them visible with the "Highlight" annotation |
| highlight_search_terms | Use TET and PDFlib to identify all occurrences of a particular word, and make them visible with the "Highlight" annotation. |
| highlight_unmapped_glyphs | Use TET and PDFlib to find all glyphs for which TET could not determine a Unicode mapping, and make them visible with the "Highlight" annotation. |
| search_and_replace_text | Find text with TET, hide it with a white rectangle, and add the replacement text on top of it. |
TETML and XSLT | |
| colorspaces | Create a listing of all colorspaces used in the document. |
| concordance | Create a concordance. |
| fields | Create a listing of all fields used in the document. |
| fontfilter | Words in a document which use a particular font in a size larger than a specified value |
| fontfinder | Font occurrences with page number and position |
| fontstat | Font and glyph statistics |
| index | "Back-of-the-book" index |
| metadata | Extract XMP metadata from TETML. |
| solr | Generate input for the Solr enterprise search server. |
| table | Extract a table to CSV file. |
| tetml | Extract text from PDF document as XML. |
| tetml2html | Convert TETML to HTML. |
| textonly | Extract raw text from TETML input. |
Text Extraction | |
| back_of_the_book_index | Create a sorted list of all words in the document along with the page numbers where the words occur. |
| concordance | Create a sorted list of unique words in a document along with counts. |
| glyphinfo | Simple PDF glyph dumper based on PDFlib TET. |
| text_extractor | PDF text extractor based on PDFlib TET. |
| text_from_annotations | Extract text from annotations with PDFlib TET and the pCOS interface. |