com.pdflib
Class TET

java.lang.Object
  extended by com.pdflib.TET
All Implemented Interfaces:
IpCOS

public final class TET
extends java.lang.Object
implements IpCOS

Text and Image Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.

Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.

Version:
5.2
Author:
Rainer Schaaf

Field Summary
 double alpha
          Direction of inline text progression or direction of the pixel rows.
static int ATTR_ANNOTATION
          Property reported in attributes by get_image_info(int): image extracted from an annotation (appearance stream).
static int ATTR_ARTIFACT
          Property reported in attributes by get_char_info(int) and get_image_info(int): text or image marked as Artifact (irrelevant content) in Tagged PDF.
static int ATTR_DEHYPHENATION_ARTIFACT
          Property reported in attributes by get_char_info(int): hyphenation character, i.e.
static int ATTR_DEHYPHENATION_POST
          Property reported in attributes by get_char_info(int): character after hyphenation.
static int ATTR_DEHYPHENATION_PRE
          Property reported in attributes by get_char_info(int): character before hyphenation.
static int ATTR_DROPCAP
          Property reported in attributes by get_char_info(int): initial large letter.
static int ATTR_NONE
          Property reported in attributes by get_char_info(int): no attribute set.
static int ATTR_PATTERN
          Property reported in attributes by get_image_info(int): image extracted from a pattern.
static int ATTR_SHADOW
          Property reported in attributes by get_char_info(int): shadowed text.
static int ATTR_SOFTMASK
          Property reported in attributes by get_image_info(int): image extracted from from a soft mask in a graphics state (defined in a Transparency Group XObject).
static int ATTR_SUB
          Property reported in attributes by get_char_info(int): subscript.
static int ATTR_SUP
          Property reported in attributes by get_char_info(int): superscript.
 int attributes
          Glyph attributes; see ATTR_NONE etc.
 double beta
          Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.
 int colorid
          Color id of the fill and stroke color.
 int colorspaceid
          Colorspace id or -1.
 double[] components
          Color components.
static int CT_INSERTED
          Type reported in type by get_char_info(int): inserted word, line, or paragraph separator.
static int CT_NORMAL
          Type reported in type by get_char_info(int): normal character represented by exactly one glyph.
static int CT_SEQ_CONT
          Type reported in type by get_char_info(int): continuation of a sequence.
static int CT_SEQ_START
          Type reported in type by get_char_info(int): start of a sequence, e.g.
 int fontid
          Index of the font in the fonts[] pseudo object.
 double fontsize
          Size of the font (always positive).
 double height
          Height of glyph or image.
static int IF_J2K
          Image format returned by write_image_file(int, int, java.lang.String): raw JPEG 2000 code stream, *.j2k.
static int IF_JBIG2
          Image format returned by write_image_file(int, int, java.lang.String): image/x-jbig2, *.jbig2.
static int IF_JP2
          Image format returned by write_image_file(int, int, java.lang.String): image/jp2, *.jp2.
static int IF_JPEG
          Image format returned by write_image_file(int, int, java.lang.String): format image/jpeg, *.jpg.
static int IF_JPF
          Image format returned by write_image_file(int, int, java.lang.String): image/jpx, *.jpf.
static int IF_TIFF
          Image format returned by write_image_file(int, int, java.lang.String): image/tiff, *.tif.
 int imageid
          Index of the image in the pCOS pseudo object images[].
 int patternid
          Pattern id or -1.
 int textrendering
          Text rendering mode; see TR_FILL etc.
static int TR_CLIP
          Text rendering mode reported in textrendering by get_char_info(int): add text to the clipping path.
static int TR_FILL
          Text rendering mode reported in textrendering by get_char_info(int): fill text.
static int TR_FILL_CLIP
          Text rendering mode reported in textrendering by get_char_info(int): fill text and add it to the clipping path.
static int TR_FILLSTROKE
          Text rendering mode reported in textrendering by get_char_info(int): fill and stroke text.
static int TR_FILLSTROKE_CLIP
          Text rendering mode reported in textrendering by get_char_info(int): fill and stroke text and add it to the clipping path.
static int TR_INVISIBLE
          Text rendering mode reported in textrendering by get_char_info(int): invisible text.
static int TR_STROKE
          Text rendering mode reported in textrendering by get_char_info(int): stroke text (outline).
static int TR_STROKE_CLIP
          Text rendering mode reported in textrendering by get_char_info(int): stroke text and add it to the clipping path.
 int type
          Character type; see CT_NORMAL etc.
 boolean unknown
          Indicates whether the glyph could be mapped to Unicode.
 int uv
          UTF-32 Unicode value of the current character.
 double width
          Width of glyph or image.
 double x
          x position of the glyph's or image's reference point.
 double y
          y position of the glyph's or image's reference point.
 
Constructor Summary
TET()
          Create a new TET object.
 
Method Summary
 void close_document(int doc)
          Release a document handle and all internal resources related to that document
 void close_page(int page)
          Release a page handle and all related resources.
 java.lang.String convert_to_unicode(java.lang.String inputformat, byte[] inputstring, java.lang.String optlist)
          Convert a string in an arbitrary encoding to a Unicode string in various formats.
 void create_pvf(java.lang.String filename, byte[] data, java.lang.String optlist)
          Create a named virtual read-only file from data provided in memory.
 int delete_pvf(java.lang.String filename)
          Delete a named virtual file and free its data structures.
 void delete()
          Delete a TET context and release all its internal resources.
 java.lang.String get_apiname()
          Get the name of the API function which caused an exception or failed.
 int get_char_info(int page)
          Get detailed information for the next character in the most recent text fragment; the results are reported in public fields.
 int get_color_info(int doc, int colorid, java.lang.String keyword)
          Get detailed information for a color id which has been retrieved with TET_get_char_info(); the results are reported in public fields.
 java.lang.String get_errmsg()
          Get the text of the last thrown exception or the reason for a failed function call.
 int get_errnum()
          Get the number of the last thrown exception or the reason for a failed function call.
 byte[] get_image_data(int doc, int imageid, java.lang.String optlist)
          Write image data to memory.
 int get_image_info(int page)
          Retrieve information about the next image on the page (but not the actual pixel data); the results are reported in public fields.
 byte[] get_tetml(int doc, java.lang.String optlist)
          Retrieve TETML data from memory.
 java.lang.String get_text(int page)
          Get the next text fragment from a page's content.
 double info_pvf(java.lang.String filename, java.lang.String keyword)
          Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).
 int open_document(java.lang.String filename, java.lang.String optlist)
          Open a disk-based or virtual PDF document for content extraction.
 int open_page(int doc, int pagenumber, java.lang.String optlist)
          Open a page for text extraction.
 void pcos_close_document(int doc, java.lang.String optlist)
          Close PLOP input document via the IpCOS interface.
 double pcos_get_number(int doc, java.lang.String path)
          Get the value of a pCOS path with type number or boolean.
 byte[] pcos_get_stream(int doc, java.lang.String optlist, java.lang.String path)
          Get the contents of a pCOS path with type stream, fstream, or string.
 java.lang.String pcos_get_string(int doc, java.lang.String path)
          Get the value of a pCOS path with type name, number, string, or boolean.
 int pcos_open_document(java.lang.String filename, java.lang.String optlist)
          Open a disk-based or virtual PDF document via the IpCOS interface.
 int process_page(int doc, int pageno, java.lang.String optlist)
          Process a page and create TETML output.
 void set_option(java.lang.String optlist)
          Set one or more global options for TET.
 int write_image_file(int doc, int imageid, java.lang.String optlist)
          Write image data to disk.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

uv

public int uv
UTF-32 Unicode value of the current character.


type

public int type
Character type; see CT_NORMAL etc. for possible values.


unknown

public boolean unknown
Indicates whether the glyph could be mapped to Unicode.


attributes

public int attributes
Glyph attributes; see ATTR_NONE etc. for possible values.


x

public double x
x position of the glyph's or image's reference point.

x/y describe the position of the glyph's or image's reference point.

Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.

Images: The reference point is the lower left corner of the image.


y

public double y
y position of the glyph's or image's reference point.

See Also:
x

width

public double width
Width of glyph or image.

Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.

Images: Width of the image on the page in points, measured along the image's edges


height

public double height
Height of glyph or image.

See Also:
width

alpha

public double alpha
Direction of inline text progression or direction of the pixel rows.

Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.

Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.


beta

public double beta
Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.

Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.

Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.


imageid

public int imageid
Index of the image in the pCOS pseudo object images[].

Detailed image properties can be retrieved via the entries in this pseudo object.


fontid

public int fontid
Index of the font in the fonts[] pseudo object.

fontid is never negative.


fontsize

public double fontsize
Size of the font (always positive).

The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.


textrendering

public int textrendering
Text rendering mode; see TR_FILL etc. for possible values.


colorid

public int colorid
Color id of the fill and stroke color.


colorspaceid

public int colorspaceid
Colorspace id or -1.


patternid

public int patternid
Pattern id or -1.


components

public double[] components
Color components.


CT_NORMAL

public static final int CT_NORMAL
Type reported in type by get_char_info(int): normal character represented by exactly one glyph.

See Also:
Constant Field Values

CT_SEQ_START

public static final int CT_SEQ_START
Type reported in type by get_char_info(int): start of a sequence, e.g. ligature.

See Also:
Constant Field Values

CT_SEQ_CONT

public static final int CT_SEQ_CONT
Type reported in type by get_char_info(int): continuation of a sequence.

See Also:
Constant Field Values

CT_INSERTED

public static final int CT_INSERTED
Type reported in type by get_char_info(int): inserted word, line, or paragraph separator.

See Also:
Constant Field Values

ATTR_NONE

public static final int ATTR_NONE
Property reported in attributes by get_char_info(int): no attribute set.

See Also:
Constant Field Values

ATTR_SUB

public static final int ATTR_SUB
Property reported in attributes by get_char_info(int): subscript.

See Also:
Constant Field Values

ATTR_SUP

public static final int ATTR_SUP
Property reported in attributes by get_char_info(int): superscript.

See Also:
Constant Field Values

ATTR_DROPCAP

public static final int ATTR_DROPCAP
Property reported in attributes by get_char_info(int): initial large letter.

See Also:
Constant Field Values

ATTR_SHADOW

public static final int ATTR_SHADOW
Property reported in attributes by get_char_info(int): shadowed text.

See Also:
Constant Field Values

ATTR_DEHYPHENATION_PRE

public static final int ATTR_DEHYPHENATION_PRE
Property reported in attributes by get_char_info(int): character before hyphenation.

See Also:
Constant Field Values

ATTR_DEHYPHENATION_ARTIFACT

public static final int ATTR_DEHYPHENATION_ARTIFACT
Property reported in attributes by get_char_info(int): hyphenation character, i.e. soft hyphen (unrelated to Tagged PDF Artifact).

See Also:
Constant Field Values

ATTR_DEHYPHENATION_POST

public static final int ATTR_DEHYPHENATION_POST
Property reported in attributes by get_char_info(int): character after hyphenation.

See Also:
Constant Field Values

ATTR_ARTIFACT

public static final int ATTR_ARTIFACT
Property reported in attributes by get_char_info(int) and get_image_info(int): text or image marked as Artifact (irrelevant content) in Tagged PDF.

See Also:
Constant Field Values

ATTR_ANNOTATION

public static final int ATTR_ANNOTATION
Property reported in attributes by get_image_info(int): image extracted from an annotation (appearance stream).

See Also:
Constant Field Values

ATTR_PATTERN

public static final int ATTR_PATTERN
Property reported in attributes by get_image_info(int): image extracted from a pattern.

See Also:
Constant Field Values

ATTR_SOFTMASK

public static final int ATTR_SOFTMASK
Property reported in attributes by get_image_info(int): image extracted from from a soft mask in a graphics state (defined in a Transparency Group XObject).

See Also:
Constant Field Values

TR_FILL

public static final int TR_FILL
Text rendering mode reported in textrendering by get_char_info(int): fill text.

See Also:
Constant Field Values

TR_STROKE

public static final int TR_STROKE
Text rendering mode reported in textrendering by get_char_info(int): stroke text (outline).

See Also:
Constant Field Values

TR_FILLSTROKE

public static final int TR_FILLSTROKE
Text rendering mode reported in textrendering by get_char_info(int): fill and stroke text.

See Also:
Constant Field Values

TR_INVISIBLE

public static final int TR_INVISIBLE
Text rendering mode reported in textrendering by get_char_info(int): invisible text.

See Also:
Constant Field Values

TR_FILL_CLIP

public static final int TR_FILL_CLIP
Text rendering mode reported in textrendering by get_char_info(int): fill text and add it to the clipping path.

See Also:
Constant Field Values

TR_STROKE_CLIP

public static final int TR_STROKE_CLIP
Text rendering mode reported in textrendering by get_char_info(int): stroke text and add it to the clipping path.

See Also:
Constant Field Values

TR_FILLSTROKE_CLIP

public static final int TR_FILLSTROKE_CLIP
Text rendering mode reported in textrendering by get_char_info(int): fill and stroke text and add it to the clipping path.

See Also:
Constant Field Values

TR_CLIP

public static final int TR_CLIP
Text rendering mode reported in textrendering by get_char_info(int): add text to the clipping path.

See Also:
Constant Field Values

IF_TIFF

public static final int IF_TIFF
Image format returned by write_image_file(int, int, java.lang.String): image/tiff, *.tif.

See Also:
Constant Field Values

IF_JPEG

public static final int IF_JPEG
Image format returned by write_image_file(int, int, java.lang.String): format image/jpeg, *.jpg.

See Also:
Constant Field Values

IF_JP2

public static final int IF_JP2
Image format returned by write_image_file(int, int, java.lang.String): image/jp2, *.jp2.

See Also:
Constant Field Values

IF_JPF

public static final int IF_JPF
Image format returned by write_image_file(int, int, java.lang.String): image/jpx, *.jpf.

See Also:
Constant Field Values

IF_J2K

public static final int IF_J2K
Image format returned by write_image_file(int, int, java.lang.String): raw JPEG 2000 code stream, *.j2k.

See Also:
Constant Field Values

IF_JBIG2

public static final int IF_JBIG2
Image format returned by write_image_file(int, int, java.lang.String): image/x-jbig2, *.jbig2.

See Also:
Constant Field Values
Constructor Detail

TET

public TET()
    throws TETException
Create a new TET object.

Throws:
TETException - May throw an exception in case of memory shortage.
Method Detail

close_document

public final void close_document(int doc)
                          throws TETException
Release a document handle and all internal resources related to that document

Parameters:
doc - doc
Throws:
TETException - TET output cannot be finished after an exception.

close_page

public final void close_page(int page)
                      throws TETException
Release a page handle and all related resources.

Parameters:
page - page
Throws:
TETException - TET output cannot be finished after an exception.

convert_to_unicode

public final java.lang.String convert_to_unicode(java.lang.String inputformat,
                                                 byte[] inputstring,
                                                 java.lang.String optlist)
                                          throws TETException
Convert a string in an arbitrary encoding to a Unicode string in various formats.

Specified by:
convert_to_unicode in interface IpCOS
Parameters:
inputformat - inputformat
inputstring - inputstring
optlist - optlist
Returns:
The converted Unicode string.
Throws:
TETException - TET output cannot be finished after an exception.

create_pvf

public final void create_pvf(java.lang.String filename,
                             byte[] data,
                             java.lang.String optlist)
                      throws TETException
Create a named virtual read-only file from data provided in memory.

Specified by:
create_pvf in interface IpCOS
Parameters:
filename - filename
data - data
optlist - optlist
Throws:
TETException - TET output cannot be finished after an exception.

delete_pvf

public final int delete_pvf(java.lang.String filename)
                     throws TETException
Delete a named virtual file and free its data structures.

Specified by:
delete_pvf in interface IpCOS
Parameters:
filename - filename
Returns:
-1 if the virtual file exists but is locked, and 1 otherwise.
Throws:
TETException - TET output cannot be finished after an exception.

get_apiname

public final java.lang.String get_apiname()
Get the name of the API function which caused an exception or failed.

Specified by:
get_apiname in interface IpCOS
Returns:
Name of an API function.

get_errmsg

public final java.lang.String get_errmsg()
Get the text of the last thrown exception or the reason for a failed function call.

Specified by:
get_errmsg in interface IpCOS
Returns:
Text containing the description of the most recent error condition.

get_errnum

public final int get_errnum()
Get the number of the last thrown exception or the reason for a failed function call.

Specified by:
get_errnum in interface IpCOS
Returns:
Error number of the most recent error condition.

get_image_data

public final byte[] get_image_data(int doc,
                                   int imageid,
                                   java.lang.String optlist)
                            throws TETException
Write image data to memory.

Parameters:
doc - doc
imageid - imageid
optlist - optlist
Returns:
Data representing the image according to the specified options.
Throws:
TETException - TET output cannot be finished after an exception.

get_text

public final java.lang.String get_text(int page)
                                throws TETException
Get the next text fragment from a page's content.

Parameters:
page - page
Returns:
A string containing the next text fragment on the page.
Throws:
TETException - TET output cannot be finished after an exception.

info_pvf

public final double info_pvf(java.lang.String filename,
                             java.lang.String keyword)
                      throws TETException
Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).

Specified by:
info_pvf in interface IpCOS
Parameters:
filename - filename
keyword - keyword
Returns:
The value of some file parameter as requested by keyword.
Throws:
TETException - TET output cannot be finished after an exception.

open_document

public final int open_document(java.lang.String filename,
                               java.lang.String optlist)
                        throws TETException
Open a disk-based or virtual PDF document for content extraction.

Parameters:
filename - filename
optlist - optlist
Returns:
-1 on error, or a document handle otherwise.
Throws:
TETException - TET output cannot be finished after an exception.

open_page

public final int open_page(int doc,
                           int pagenumber,
                           java.lang.String optlist)
                    throws TETException
Open a page for text extraction.

Parameters:
doc - doc
pagenumber - pagenumber
optlist - optlist
Returns:
A handle for the page, or -1 in case of an error.
Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_number

public final double pcos_get_number(int doc,
                                    java.lang.String path)
                             throws TETException
Get the value of a pCOS path with type number or boolean.

Specified by:
pcos_get_number in interface IpCOS
Parameters:
doc - doc
path - path
Returns:
The numerical value of the object identified by the pCOS path.
Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_string

public final java.lang.String pcos_get_string(int doc,
                                              java.lang.String path)
                                       throws TETException
Get the value of a pCOS path with type name, number, string, or boolean.

Specified by:
pcos_get_string in interface IpCOS
Parameters:
doc - doc
path - path
Returns:
A string with the value of the object identified by the pCOS path.
Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_stream

public final byte[] pcos_get_stream(int doc,
                                    java.lang.String optlist,
                                    java.lang.String path)
                             throws TETException
Get the contents of a pCOS path with type stream, fstream, or string.

Specified by:
pcos_get_stream in interface IpCOS
Parameters:
doc - doc
optlist - optlist
path - path
Returns:
The unencrypted data contained in the stream or string.
Throws:
TETException - TET output cannot be finished after an exception.

set_option

public final void set_option(java.lang.String optlist)
                      throws TETException
Set one or more global options for TET.

Specified by:
set_option in interface IpCOS
Parameters:
optlist - optlist
Throws:
TETException - TET output cannot be finished after an exception.

write_image_file

public final int write_image_file(int doc,
                                  int imageid,
                                  java.lang.String optlist)
                           throws TETException
Write image data to disk.

Parameters:
doc - doc
imageid - imageid
optlist - optlist
Returns:
-1 on error, or the image format otherwise (see IF_TIFF etc.)
Throws:
TETException - TET output cannot be finished after an exception.

process_page

public final int process_page(int doc,
                              int pageno,
                              java.lang.String optlist)
                       throws TETException
Process a page and create TETML output.

Parameters:
doc - doc
pageno - pageno
optlist - optlist
Returns:
Always 1. PDF problems are reported in a TETML Exception element.
Throws:
TETException - TET output cannot be finished after an exception.

get_tetml

public final byte[] get_tetml(int doc,
                              java.lang.String optlist)
                       throws TETException
Retrieve TETML data from memory.

Parameters:
doc - doc
optlist - optlist
Returns:
A byte array containing the next chunk of TETML data.
Throws:
TETException - TET output cannot be finished after an exception.

delete

public final void delete()
Delete a TET context and release all its internal resources. This should be called for cleanup when processing is done, and after a TETException occurred. This method may also be called by the finalizer, but it is safe to issue multiple calls.

Specified by:
delete in interface IpCOS

get_char_info

public final int get_char_info(int page)
                        throws TETException
Get detailed information for the next character in the most recent text fragment; the results are reported in public fields.

Parameters:
page - page
Returns:
Bindig-specific error or success code.
Throws:
TETException - May throw an exception for various reasons.

get_color_info

public final int get_color_info(int doc,
                                int colorid,
                                java.lang.String keyword)
                         throws TETException
Get detailed information for a color id which has been retrieved with TET_get_char_info(); the results are reported in public fields.

Parameters:
doc - doc
colorid - colorid
keyword - keyword
Returns:
Details about the requested color space and color. various reasons.
Throws:
TETException - May throw an exception for

get_image_info

public final int get_image_info(int page)
                         throws TETException
Retrieve information about the next image on the page (but not the actual pixel data); the results are reported in public fields.

Parameters:
page - page
Returns:
Details about the next image on the page.
Throws:
TETException - May throw an exception for various reasons.

pcos_open_document

public int pcos_open_document(java.lang.String filename,
                              java.lang.String optlist)
                       throws java.lang.Exception
Open a disk-based or virtual PDF document via the IpCOS interface.

Specified by:
pcos_open_document in interface IpCOS
Parameters:
filename - The full path name of the PDF file to be opened. The file will be searched by means of the SearchPath resource.
optlist - An option list specifying document options.
Returns:
A document handle.
Throws:
java.lang.Exception - see manual

pcos_close_document

public void pcos_close_document(int doc,
                                java.lang.String optlist)
                         throws java.lang.Exception
Close PLOP input document via the IpCOS interface.

Specified by:
pcos_close_document in interface IpCOS
Parameters:
doc - A valid document handle obtained with open_document().
optlist - An option list specifying document options.
Throws:
java.lang.Exception - see manual