com.pdflib
Class TET

java.lang.Object
  extended by com.pdflib.TET
All Implemented Interfaces:
IpCOS

public final class TET
extends java.lang.Object
implements IpCOS

Text and Image Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.

Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.

Version:
5.1
Author:
Rainer Schaaf

Field Summary
 double alpha
          Direction of inline text progression or direction of the pixel rows.
 int attributes
          Glyph attributes expressed as bits.
 double beta
          Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.
 int colorid
          unique text color id
 int colorspaceid
          colorspace id or -1
 double[] components
          color components
 int fontid
          Index of the font in the fonts[] pseudo object.
 double fontsize
          Size of the font (always positive).
 double height
          Height of glyph or image.
 int imageid
          Index of the image in the pCOS pseudo object images[].
 int patternid
          pattern id or -1
 int textrendering
          Text rendering mode.
 int type
          Type of the character.
 boolean unknown
          Indicates whether the glyph could be mapped to Unicode.
 int uv
          UTF-32 Unicode value of the current character.
 double width
          Width of glyph or image.
 double x
          x position of the glyph's or image's reference point.
 double y
          y position of the glyph's or image's reference point.
 
Constructor Summary
TET()
          Create a new TET object.
 
Method Summary
 void close_document(int doc)
          Release a document handle and all internal resources related to that document
 void close_page(int page)
          Release a page handle and all related resources.
 java.lang.String convert_to_unicode(java.lang.String inputformat, byte[] inputstring, java.lang.String optlist)
          Convert a string in an arbitrary encoding to a Unicode string in various formats.
 void create_pvf(java.lang.String filename, byte[] data, java.lang.String optlist)
          Create a named virtual read-only file from data provided in memory.
 int delete_pvf(java.lang.String filename)
          Delete a named virtual file and free its data structures.
 void delete()
          Delete a TET context and release all its internal resources.
 java.lang.String get_apiname()
          Get the name of the API function which caused an exception or failed.
 int get_char_info(int page)
          Get detailed information for the next character in the most recent text fragment.
 int get_color_info(int doc, int colorid, java.lang.String keyword)
          Get detailed information for a color id which has been retrieved with TET_get_char_info.
 java.lang.String get_errmsg()
          Get the text of the last thrown exception or the reason for a failed function call.
 int get_errnum()
          Get the number of the last thrown exception or the reason for a failed function call.
 byte[] get_image_data(int doc, int imageid, java.lang.String optlist)
          Write image data to memory.
 int get_image_info(int page)
          Retrieve information about the next image on the page (but not the actual pixel data).
 byte[] get_tetml(int doc, java.lang.String optlist)
          Retrieve TETML data from memory.
 java.lang.String get_text(int page)
          Get the next text fragment from a page's content.
 byte[] get_xml_data(int doc, java.lang.String optlist)
          Deprecated. use TET_get_tetml().
 double info_pvf(java.lang.String filename, java.lang.String keyword)
          Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).
 int open_document(java.lang.String filename, java.lang.String optlist)
          Open a disk-based or virtual PDF document for content extraction.
 int open_page(int doc, int pagenumber, java.lang.String optlist)
          Open a page for text extraction.
 void pcos_close_document(int doc, java.lang.String optlist)
          Close TET input document via the IpCOS interface.
 double pcos_get_number(int doc, java.lang.String path)
          Get the value of a pCOS path with type number or boolean.
 byte[] pcos_get_stream(int doc, java.lang.String optlist, java.lang.String path)
          Get the contents of a pCOS path with type stream, fstream, or string.
 java.lang.String pcos_get_string(int doc, java.lang.String path)
          Get the value of a pCOS path with type name, number, string, or boolean.
 int pcos_open_document(java.lang.String filename, java.lang.String optlist)
          Open TET input document via the IpCOS interface.
 int process_page(int doc, int pageno, java.lang.String optlist)
          Process a page and create TETML output.
 void set_option(java.lang.String optlist)
          Set one or more global options for TET.
 int write_image_file(int doc, int imageid, java.lang.String optlist)
          Write image data to disk.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

uv

public int uv
UTF-32 Unicode value of the current character.

It will be 0 if the corresponding UTF-16 value is the trailing value of a surrogate pair (i.e. if type=11).


type

public int type
Type of the character.

The following types describe real characters which correspond to a glyph on the page. The values of all other properties/fields are determined by the corresponding glyph:

The following types describe artificial characters which do not correspond to a glyph on the page. The x and y fields will specify the most recent real character.s endpoint, the width field will be 0, and all other fields except uv will contain the values corresponding to the most recent real character:


unknown

public boolean unknown
Indicates whether the glyph could be mapped to Unicode.

Usually false, but will be true if the original glyph could not be mapped to Unicode and has been replaced with the character specified as unknownchar.


attributes

public int attributes
Glyph attributes expressed as bits.

The bits can be combined:


x

public double x
x position of the glyph's or image's reference point.

x/y describe the position of the glyph's or image's reference point.

Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.

Images: The reference point is the lower left corner of the image.


y

public double y
y position of the glyph's or image's reference point.

See Also:
x

width

public double width
Width of glyph or image.

Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.

Images: Width of the image on the page in points, measured along the image's edges


height

public double height
Height of glyph or image.

See Also:
width

alpha

public double alpha
Direction of inline text progression or direction of the pixel rows.

Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.

Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.


beta

public double beta
Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.

Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.

Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.


imageid

public int imageid
Index of the image in the pCOS pseudo object images[].

Detailed image properties can be retrieved via the entries in this pseudo object.


fontid

public int fontid
Index of the font in the fonts[] pseudo object.

fontid is never negative.


fontsize

public double fontsize
Size of the font (always positive).

The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.


textrendering

public int textrendering
Text rendering mode.


colorid

public int colorid
unique text color id


colorspaceid

public int colorspaceid
colorspace id or -1


patternid

public int patternid
pattern id or -1


components

public double[] components
color components

Constructor Detail

TET

public TET()
    throws TETException
Create a new TET object.

Throws:
TETException - May throw an exception in case of memory shortage.
Method Detail

close_document

public final void close_document(int doc)
                          throws TETException
Release a document handle and all internal resources related to that document

Throws:
TETException - TET output cannot be finished after an exception.

close_page

public final void close_page(int page)
                      throws TETException
Release a page handle and all related resources.

Throws:
TETException - TET output cannot be finished after an exception.

convert_to_unicode

public final java.lang.String convert_to_unicode(java.lang.String inputformat,
                                                 byte[] inputstring,
                                                 java.lang.String optlist)
                                          throws TETException
Convert a string in an arbitrary encoding to a Unicode string in various formats.

Specified by:
convert_to_unicode in interface IpCOS
Returns:
The converted Unicode string.
Throws:
TETException - TET output cannot be finished after an exception.

create_pvf

public final void create_pvf(java.lang.String filename,
                             byte[] data,
                             java.lang.String optlist)
                      throws TETException
Create a named virtual read-only file from data provided in memory.

Specified by:
create_pvf in interface IpCOS
Throws:
TETException - TET output cannot be finished after an exception.

delete_pvf

public final int delete_pvf(java.lang.String filename)
                     throws TETException
Delete a named virtual file and free its data structures.

Specified by:
delete_pvf in interface IpCOS
Returns:
-1 if the virtual file exists but is locked, and 1 otherwise.
Throws:
TETException - TET output cannot be finished after an exception.

get_apiname

public final java.lang.String get_apiname()
Get the name of the API function which caused an exception or failed.

Specified by:
get_apiname in interface IpCOS
Returns:
Name of an API function.

get_errmsg

public final java.lang.String get_errmsg()
Get the text of the last thrown exception or the reason for a failed function call.

Specified by:
get_errmsg in interface IpCOS
Returns:
Text containing the description of the most recent error condition.

get_errnum

public final int get_errnum()
Get the number of the last thrown exception or the reason for a failed function call.

Specified by:
get_errnum in interface IpCOS
Returns:
Error number of the most recent error condition.

get_image_data

public final byte[] get_image_data(int doc,
                                   int imageid,
                                   java.lang.String optlist)
                            throws TETException
Write image data to memory.

Returns:
Data representing the image according to the specified options.
Throws:
TETException - TET output cannot be finished after an exception.

get_text

public final java.lang.String get_text(int page)
                                throws TETException
Get the next text fragment from a page's content.

Throws:
TETException - TET output cannot be finished after an exception.

info_pvf

public final double info_pvf(java.lang.String filename,
                             java.lang.String keyword)
                      throws TETException
Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).

Specified by:
info_pvf in interface IpCOS
Returns:
The value of some file parameter as requested by keyword.
Throws:
TETException - TET output cannot be finished after an exception.

open_document

public final int open_document(java.lang.String filename,
                               java.lang.String optlist)
                        throws TETException
Open a disk-based or virtual PDF document for content extraction.

Returns:
-1 on error, or a document handle otherwise.
Throws:
TETException - TET output cannot be finished after an exception.

open_page

public final int open_page(int doc,
                           int pagenumber,
                           java.lang.String optlist)
                    throws TETException
Open a page for text extraction.

Returns:
A handle for the page, or -1 in case of an error.
Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_number

public final double pcos_get_number(int doc,
                                    java.lang.String path)
                             throws TETException
Get the value of a pCOS path with type number or boolean.

Specified by:
pcos_get_number in interface IpCOS
Returns:
The numerical value of the object identified by the pCOS path.
Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_string

public final java.lang.String pcos_get_string(int doc,
                                              java.lang.String path)
                                       throws TETException
Get the value of a pCOS path with type name, number, string, or boolean.

Specified by:
pcos_get_string in interface IpCOS
Returns:
A string with the value of the object identified by the pCOS path.
Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_stream

public final byte[] pcos_get_stream(int doc,
                                    java.lang.String optlist,
                                    java.lang.String path)
                             throws TETException
Get the contents of a pCOS path with type stream, fstream, or string.

Specified by:
pcos_get_stream in interface IpCOS
Returns:
The unencrypted data contained in the stream or string.
Throws:
TETException - TET output cannot be finished after an exception.

set_option

public final void set_option(java.lang.String optlist)
                      throws TETException
Set one or more global options for TET.

Specified by:
set_option in interface IpCOS
Throws:
TETException - TET output cannot be finished after an exception.

write_image_file

public final int write_image_file(int doc,
                                  int imageid,
                                  java.lang.String optlist)
                           throws TETException
Write image data to disk.

Returns:
-1 on error, or a value greater than 0 otherwise.
Throws:
TETException - TET output cannot be finished after an exception.

process_page

public final int process_page(int doc,
                              int pageno,
                              java.lang.String optlist)
                       throws TETException
Process a page and create TETML output.

Returns:
Always 1. PDF problems are reported in a TETML Exception element.
Throws:
TETException - TET output cannot be finished after an exception.

get_xml_data

public final byte[] get_xml_data(int doc,
                                 java.lang.String optlist)
                          throws TETException
Deprecated. use TET_get_tetml().

Throws:
TETException - TET output cannot be finished after an exception.

get_tetml

public final byte[] get_tetml(int doc,
                              java.lang.String optlist)
                       throws TETException
Retrieve TETML data from memory.

Returns:
A byte array containing the next chunk of TETML data.
Throws:
TETException - TET output cannot be finished after an exception.

delete

public final void delete()
Delete a TET context and release all its internal resources. This should be called for cleanup when processing is done, and after a TETException occurred. This method may also be called by the finalizer, but it is safe to issue multiple calls.

Specified by:
delete in interface IpCOS

get_char_info

public final int get_char_info(int page)
                        throws TETException
Get detailed information for the next character in the most recent text fragment.

Throws:
TETException - May throw an exception for various reasons.

get_color_info

public final int get_color_info(int doc,
                                int colorid,
                                java.lang.String keyword)
                         throws TETException
Get detailed information for a color id which has been retrieved with TET_get_char_info.

Throws:
TETException - May throw an exception for various reasons.

get_image_info

public final int get_image_info(int page)
                         throws TETException
Retrieve information about the next image on the page (but not the actual pixel data).

Throws:
TETException - May throw an exception for various reasons.

pcos_open_document

public int pcos_open_document(java.lang.String filename,
                              java.lang.String optlist)
                       throws java.lang.Exception
Open TET input document via the IpCOS interface.

Specified by:
pcos_open_document in interface IpCOS
Returns:
A document handle.
Throws:
java.lang.Exception

pcos_close_document

public void pcos_close_document(int doc,
                                java.lang.String optlist)
                         throws java.lang.Exception
Close TET input document via the IpCOS interface.

Specified by:
pcos_close_document in interface IpCOS
Throws:
java.lang.Exception