PDFlib pCOS – PDF Information Retrieval Tool


What is PDFlib pCOS?

PDFlib pCOS provides a simple and elegant facility for retrieving any information from a PDF document which is not part of the page contents. For example, PDF metadata, interactive elements (links etc.), or page dimensions can easily be queried with pCOS.
With pCOS you can extract a variety of interesting items and create output for different purposes. By processing multiple PDF documents with a single call you can easily create summaries of document info entries, page formats, fonts, or any other property. Combined with tabular output this provides a powerful PDF administration tool.
There are many every-day pCOS applications for PDF practitioners, but you can also use PDFlib pCOS as a tool for learning or debugging PDF. Here are some typical scenarios:

Check incoming documents for predefined criteria

Check PDFs for security problems and active content (Java-Script etc.)

Check documents for quality assurance before publication

Identify problem files in a large collection

Create property summaries for document management

Learn details of PDF data structures

PDFlib pCOS Features

Supported Input

PDFlib pCOS supports all relevant flavors of PDF input:

All PDF versions up to PDF 1.7 (Acrobat 8)

RC4 and AES encryption (password may be required)

Sophisticated security model: even if you don’t know the password, you can query certain pieces of information as long as this doesn’t violate the document

author’s intentions

Damaged PDF input documents will be repaired if possible

Information Retrieval

PDFlib pCOS offers a simple query interface, without the need for low-level parser programming. With PDFlib pCOS you can extract a variety of interesting items, such as:

Document info entries and XMP metadata

General information: linearization and tagged PDF status, encryption details and permission settings, number of pages and fonts

All fonts with their name, embedding status, etc.

Images with size, bit depth, color space, compression, etc.

Color space details for all PDF color variations

Target URLs and coordinates of Web links

All bookmarks along with the corresponding page numbers, e.g. to create a table of contents

Form field data: full field names, contents, position, etc.

Page size, CropBox, page rotation

Status of PDF/X and PDF/A compliant files

List or extract file attachments

Layer names, page labels, article threads

Annotation details

List all comments along with the reviewer’s name

Digital signature details: name of signature field(s), signed/unsigned, name of signer, date and reason of signature

Extract ICC output intent profiles from PDF/X or PDF/A files

List PDFlib block properties

JavaScript on document, page, annotation, or field level

Output Formats

PDFlib pCOS can create output for different purposes:

Plain text output

Tabular output for processing with a spreadsheet/database

Binary data for reuse, e.g. ICC profiles or file attachments

Unicode text output in UTF-8 or UTF-16 formats

User-defined output formats for custom post-processing

pCOS Paths – Simple Syntax for PDF Objects

Instead of getting bogged down by complex tree structures, e.g. for bookmarks or form fields, you can easily access PDF objects by using the simple pCOS path syntax. It offers convenient shortcuts for accessing commonly used PDF objects, such as pages, fonts, bookmarks, form fields etc.

pCOS Library or Command-Line Tool?

pCOS is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features, but are suitable for different deployment tasks.

The pCOS programming library is used...

...for integration into desktop or server applications. Examples for using the library with all supported language bindings are included in the pCOS package. A variety of additional examples is available in the pCOS Cookbook on the PDFlib Web site.

The pCOS command-line tool is suited...

...for batch processing PDF documents. It doesn’t require any programming, but offers powerful command-line options which can be used to integrate it into complex workflows. The pCOS command-line tool extends the features of the library:

Simple retrieval of common PDF elements, such as bookmarks, annotations, metadata, form fields, etc.

Extended mode for querying more complex objects and customizing the output format

Extract data items, such as file attachments, ICC profiles, etc.

Emit information as comma-separated values or a userdefined format for import into a spreadsheet or database

Recursion feature for dumping composite PDF objects, such as dictionaries and arrays

Supported Development Environments

PDFlib pCOS is everywhere – it runs on practically all computing platforms. We offer variants for all common flavors of Windows, Mac OS, Linux and Unix.
The pCOS core is written in highly optimized C code for maximum performance and small overhead. Via a simple API (Application Programming Interface) the pCOS functionality is accessible from a variety of development environments:

COM for use with VB, ASP,  and many other languages

C and C++

Java, including servlets and Java Application Server

.NET for use with C#, VB.NET, ASP.NET, etc.

Perl

PHP