pCOS - PDF Information Retrieval Interface

What is the pCOS Interface?

pCOS provides a simple and elegant facility for retrieving any information from a PDF document which is not part of the page contents (page contents can be extracted with PDFlib TET). The pCOS interface is not a standalone product, but an integrated part of the following products:

PDFlib+PDI (pCOS is not included in the base PDFlib product)
PDFlib Personalization Server (PPS)
Text and Image Extraction Toolkit (PDFlib TET)
In PDFlib TET PDF IFilter you can use pCOS to retrieve information from PDF documents and use it for indexing and search.
PDFlib PLOP (also includes the pCOS command-line tool in addition to the programming interface)
PDFlib PLOP DS (also includes the pCOS command-line tool in addition to the programming interface)

With pCOS you can extract a variety of interesting items and create output for different purposes. By processing multiple PDF documents with a single call you can easily create summaries of document info entries, page formats, fonts, or any other property. Combined with tabular output this provides a powerful PDF administration tool.

There are many application scenarios for using pCOS within PDF workflows, but you can also use pCOS as a tool for learning or debugging PDF. Here are some typical situations:

Check incoming documents for predefined criteria
Identify problem files in a large collection
Create metadata summaries for document management
quality assurance before publishing documents
document retrieval and repository workflows
summarize the bookmarks
extract components of PDF documents, e.g. ICC profiles
Check PDFs for security problems (JavaScript etc.)

pCOS Cookbook

The pCOS Cookbook is a collection of programming examples which demonstrate the use of pCOS for various PDF retrieval tasks. The Cookbook is available here and includes code, input documents and pCOS output.

pCOS Features

Information Retrieval

With pCOS you can extract a variety of interesting items, such as:

document info fields and XMP metadata
general information: linearization and tagged PDF status, encryption details and permission settings, number of pages and fonts
fonts with name, embedding status, etc.
image data, such as bit depth, color space, compression, XMP
color space details
target URLs and coordinates of Web links
bookmarks and the corresponding page numbers, e.g. to create a table of contents
form field data: full field names, contents, position, etc.
page size, CropBox, page rotation
status of ISO standards: PDF/A, PDF/E, PDF/UA, PDF/VCR, PDF/VT, and PDF/X
geospatial reference information
list or extract file attachments
layer names, page labels, article threads
annotation details
list comments along with the reviewer’s name
digital signature details: name of signature field(s), signed/unsigned, name of signer, PAdES
extract ICC output intent profiles from PDF/X or PDF/A documents
Block properties for PDFlib Personalization Server
JavaScript on document, page, annotation, or field level
retrieve XML invoice data from ZUGFeRD documents
properties of PDF Packages/Portfolios

Supported Input

pCOS supports all flavors of PDF input:

all PDF versions up to Acrobat DC, i.e PDF 1.7 (ISO 32000-1) up to extension level 8, as well as PDF 2.0 (ISO 32000-2)
encrypted documents (password may be required)
damaged PDF input documents will be repaired if possible

Output Formats

pCOS can create output for different purposes:

plain text output
Unicode text output in UTF-8 or UTF-16 formats
tabular output for processing with a spreadsheet or database
binary data, e.g. ICC profiles or file attachments
user-defined output formats for custom post-processing

pCOS Paths - Simple Syntax for PDF Objects

Instead of getting bogged down by complex tree structures, e.g. for bookmarks or form fields, you can easily access PDF objects by using the simple pCOS path syntax. It offers convenient shortcuts for accessing commonly used PDF objects, such as pages, fonts, bookmarks, form fields, etc.

pCOS programming Interface or Command-Line Tool?

pCOS is available as a programming interface for various development environments, and as a command-line tool for batch operations. Note that the pCOS command-line tool is included only in the PDFlib PLOP and PLOP DS product packages. Programming interface and command-line tool offer similar features, but are suitable for different deployment tasks.

The pCOS Programming Interface is used...

...for integration into desktop or server applications. Simple examples for using the programming interface with all supported language bindings are included in the product packages. Many more examples are available in the pCOS Cookbook.

The pCOS Command-Line Tool is suited...

...for batch processing PDF documents. It doesn’t require any programming, but offers powerful command-line options which can be used to integrate it into complex workflows. The pCOS command-line tool extends the features of the library:

simple retrieval of common PDF elements, such as bookmarks, annotations, metadata, form fields, etc.
extended mode for querying more complex objects and customizing the output format
extract data items, such as file attachments, ICC profiles, etc.
emit information as comma-separated values or a userdefined format for import into a spreadsheet or database
recursion feature for dumping composite PDF objects such as dictionaries and arrays

Evaluation

All products which contain the pCOS interface are available for evaluation. Fully functional evaluation versions including documentation and samples are available in the Download section of our Web site.