Text Extraction | |
Process the text contents of PDF documents | |
Simple text extractor | |
Create a list of all unique words in the document. | |
Create a sorted list of all words in the document along with the page numbers where the words occur. | |
Font Processing | |
Analyze font information in PDF documents | |
Identify the locations in a PDF where a particular font is used; print the page number, location, and start of text for each hit | |
Font statistics | |
Image Extraction | |
Extract raster images from PDF documents | |
Find out image resolutions | |
Count images in a PDF according to various interpretations | |
Resource-based image extractor based on PDFlib TET | |
Simple image reader | |
PDF image extractor based on PDFlib TET. | |
TET and PDFlib | |
Modify or enhance PDF document with PDFlib+PDI based on their text contents | |
Enhance PDFs with TET and PDFlib+PDI. | |
Generate bookmarks based on specific page content. | |
Split a document into smaller parts based on some page contents. | |
Highlight text on imported pages based on some criteria. | |
Find text with TET, hide it with a white rectangle, and add the replacement text on top of it. | |
Automatically create table of contents based on tyographic rules. | |
Highlight unmapped glyphs (i.e. glyphs for which TET could not determine a Unicode mapping). | |
Highlight text in certain fonts. | |
TETML and XSLT | |
Convert PDF documents to TETML and process TETML with XSLT | |
Simple TETML converter | |
Convert TETML to HTML. | |
Generate input for the Solr enterprise search server. | |
Extract raw text from TETML input. | |
Extract XMP metadata from TETML. | |
Extract a table to CSV file. | |
Create a concordance. | |
words in a document which use a particular font in a size larger than a specified value | |
font occurrences with page number and position | |
font and glyph statistics | |
"back-of-the-book" index | |
Special | |
Other topics | |
Extract text and images from attachments. | |
Classify the pages in a document according to text or image content. | |
Restrict text extraction to a particular area on the page. | |
Process multiple documents in a loop. | |