com.pdflib
Class TET

java.lang.Object
  extended by com.pdflib.TET

public final class TET
extends java.lang.Object

Text and Image Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.

Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.

Version:
5.0
Author:
Rainer Schaaf

Field Summary
 double alpha
          Direction of inline text progression or direction of the pixel rows.
 int attributes
          Glyph attributes expressed as bits.
 double beta
          Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.
 int colorid
          unique text color id
 int colorspaceid
          colorspace id or -1
 double[] components
          color components
 int fontid
          Index of the font in the fonts[] pseudo object.
 double fontsize
          Size of the font (always positive).
 double height
          Height of glyph or image.
 int imageid
          Index of the image in the pCOS pseudo object images[].
 int patternid
          pattern id or -1
 int textrendering
          Text rendering mode.
 int type
          Type of the character.
 boolean unknown
          Indicates whether the glyph could be mapped to Unicode.
 int uv
          UTF-32 Unicode value of the current character.
 double width
          Width of glyph or image.
 double x
          x position of the glyph's or image's reference point.
 double y
          y position of the glyph's or image's reference point.
 
Constructor Summary
TET()
          Create a new TET object.
 
Method Summary
 void close_document(int doc)
          Release a document handle and all internal resources related to that * document
 void close_page(int page)
          Release a page handle and all related resources.
 java.lang.String convert_to_unicode(java.lang.String inputformat, byte[] inputstring, java.lang.String optlist)
          Convert a string in an arbitrary encoding to a Unicode string in various formats.
 void create_pvf(java.lang.String filename, byte[] data, java.lang.String optlist)
          Create a named virtual read-only file from data provided in memory.
 int delete_pvf(java.lang.String filename)
          Delete a named virtual file and free its data structures (but not the * contents).
 void delete()
          Delete a TET context and release all its internal resources.
 java.lang.String get_apiname()
          Get the name of the API function which caused an exception or failed.
 int get_char_info(int page)
          Get detailed information for the next character in the most recent text fragment.
 int get_color_info(int doc, int colorid, java.lang.String keyword)
          Get detailed information for a color id which has been retrieved with TET_get_char_info.
 java.lang.String get_errmsg()
          Get the text of the last thrown exception or the reason for a failed * function call.
 int get_errnum()
          Get the number of the last thrown exception or the reason for a failed * function call.
 byte[] get_image_data(int doc, int imageid, java.lang.String optlist)
          Retrieve image data in memory.
 int get_image_info(int page)
          Retrieve information about the next image on the page (but not the actual pixel data).
 byte[] get_tetml(int doc, java.lang.String optlist)
          Retrieve TETML from memory.
 java.lang.String get_text(int page)
          Get the next text fragment from a page's content.
 byte[] get_xml_data(int doc, java.lang.String optlist)
          Deprecated, use TET_get_tetml().
 double info_pvf(java.lang.String filename, java.lang.String keyword)
          Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).
 int open_document_mem(byte[] data, java.lang.String optlist)
          Deprecated. Deprecated: use TET_create_pvf() and TET_open_document().
 int open_document(java.lang.String filename, java.lang.String optlist)
          Open a disk-based or virtual PDF document for content extraction.
 int open_page(int doc, int pagenumber, java.lang.String optlist)
          Open a page for text extraction.
 double pcos_get_number(int doc, java.lang.String path)
          Get the value of a pCOS path with type number or boolean.
 byte[] pcos_get_stream(int doc, java.lang.String optlist, java.lang.String path)
          Get the contents of a pCOS path with type stream, fstream, or string.
 java.lang.String pcos_get_string(int doc, java.lang.String path)
          Get the value of a pCOS path with type name, number, string, or boolean.
 int process_page(int doc, int pageno, java.lang.String optlist)
          Process a page and create TETML output.
 void set_option(java.lang.String optlist)
          Set one or more global options for TET.
 int write_image_file(int doc, int imageid, java.lang.String optlist)
          Write image data to disk.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

uv

public int uv
UTF-32 Unicode value of the current character.

It will be 0 if the corresponding UTF-16 value is the trailing value of a surrogate pair (i.e. if type=11).


type

public int type
Type of the character.

The following types describe real characters which correspond to a glyph on the page. The values of all other properties/fields are determined by the corresponding glyph:

The following types describe artificial characters which do not correspond to a glyph on the page. The x and y fields will specify the most recent real character.s endpoint, the width field will be 0, and all other fields except uv will contain the values corresponding to the most recent real character:


unknown

public boolean unknown
Indicates whether the glyph could be mapped to Unicode.

Usually false, but will be true if the original glyph could not be mapped to Unicode and has been replaced with the character specified as unknownchar.


attributes

public int attributes
Glyph attributes expressed as bits.

The bits can be combined:


x

public double x
x position of the glyph's or image's reference point.

x/y describe the position of the glyph's or image's reference point.

Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.

Images: The reference point is the lower left corner of the image.


y

public double y
y position of the glyph's or image's reference point.

See Also:
x

width

public double width
Width of glyph or image.

Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.

Images: Width of the image on the page in points, measured along the image's edges


height

public double height
Height of glyph or image.

See Also:
width

alpha

public double alpha
Direction of inline text progression or direction of the pixel rows.

Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.

Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.


beta

public double beta
Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.

Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.

Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.


imageid

public int imageid
Index of the image in the pCOS pseudo object images[].

Detailed image properties can be retrieved via the entries in this pseudo object.


fontid

public int fontid
Index of the font in the fonts[] pseudo object.

fontid is never negative.


fontsize

public double fontsize
Size of the font (always positive).

The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.


textrendering

public int textrendering
Text rendering mode.

colorid

public int colorid
unique text color id


colorspaceid

public int colorspaceid
colorspace id or -1


patternid

public int patternid
pattern id or -1


components

public double[] components
color components

Constructor Detail

TET

public TET()
    throws TETException
Create a new TET object.

Throws:
TETException - May throw an exception in case of memory shortage.
Method Detail

close_document

public final void close_document(int doc)
                          throws TETException
Release a document handle and all internal resources related to that * document

Throws:
TETException - TET output cannot be finished after an exception.

close_page

public final void close_page(int page)
                      throws TETException
Release a page handle and all related resources.

Throws:
TETException - TET output cannot be finished after an exception.

create_pvf

public final void create_pvf(java.lang.String filename,
                             byte[] data,
                             java.lang.String optlist)
                      throws TETException
Create a named virtual read-only file from data provided in memory.

Throws:
TETException - TET output cannot be finished after an exception.

delete_pvf

public final int delete_pvf(java.lang.String filename)
                     throws TETException
Delete a named virtual file and free its data structures (but not the * contents).

Throws:
TETException - TET output cannot be finished after an exception.

get_apiname

public final java.lang.String get_apiname()
Get the name of the API function which caused an exception or failed.

Throws:
TETException - TET output cannot be finished after an exception.

get_errmsg

public final java.lang.String get_errmsg()
Get the text of the last thrown exception or the reason for a failed * function call.

Throws:
TETException - TET output cannot be finished after an exception.

get_errnum

public final int get_errnum()
Get the number of the last thrown exception or the reason for a failed * function call.

Throws:
TETException - TET output cannot be finished after an exception.

get_image_data

public final byte[] get_image_data(int doc,
                                   int imageid,
                                   java.lang.String optlist)
                            throws TETException
Retrieve image data in memory.

Throws:
TETException - TET output cannot be finished after an exception.

get_text

public final java.lang.String get_text(int page)
                                throws TETException
Get the next text fragment from a page's content.

Throws:
TETException - TET output cannot be finished after an exception.

info_pvf

public final double info_pvf(java.lang.String filename,
                             java.lang.String keyword)
                      throws TETException
Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).

Returns:
The value of some file parameter as requested by keyword.
Throws:
TETException - TET output cannot be finished after an exception.

open_document

public final int open_document(java.lang.String filename,
                               java.lang.String optlist)
                        throws TETException
Open a disk-based or virtual PDF document for content extraction.

Throws:
TETException - TET output cannot be finished after an exception.

open_document_mem

public final int open_document_mem(byte[] data,
                                   java.lang.String optlist)
                            throws TETException
Deprecated. Deprecated: use TET_create_pvf() and TET_open_document().

Throws:
TETException - TET output cannot be finished after an exception.

open_page

public final int open_page(int doc,
                           int pagenumber,
                           java.lang.String optlist)
                    throws TETException
Open a page for text extraction.

Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_number

public final double pcos_get_number(int doc,
                                    java.lang.String path)
                             throws TETException
Get the value of a pCOS path with type number or boolean.

Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_string

public final java.lang.String pcos_get_string(int doc,
                                              java.lang.String path)
                                       throws TETException
Get the value of a pCOS path with type name, number, string, or boolean.

Throws:
TETException - TET output cannot be finished after an exception.

pcos_get_stream

public final byte[] pcos_get_stream(int doc,
                                    java.lang.String optlist,
                                    java.lang.String path)
                             throws TETException
Get the contents of a pCOS path with type stream, fstream, or string.

Throws:
TETException - TET output cannot be finished after an exception.

set_option

public final void set_option(java.lang.String optlist)
                      throws TETException
Set one or more global options for TET.

Throws:
TETException - TET output cannot be finished after an exception.

convert_to_unicode

public final java.lang.String convert_to_unicode(java.lang.String inputformat,
                                                 byte[] inputstring,
                                                 java.lang.String optlist)
                                          throws TETException
Convert a string in an arbitrary encoding to a Unicode string in various formats.

Returns:
The converted Unicode string.
Throws:
TETException - TET output cannot be finished after an exception.

write_image_file

public final int write_image_file(int doc,
                                  int imageid,
                                  java.lang.String optlist)
                           throws TETException
Write image data to disk.

Throws:
TETException - TET output cannot be finished after an exception.

process_page

public final int process_page(int doc,
                              int pageno,
                              java.lang.String optlist)
                       throws TETException
Process a page and create TETML output.

Throws:
TETException - TET output cannot be finished after an exception.

get_xml_data

public final byte[] get_xml_data(int doc,
                                 java.lang.String optlist)
                          throws TETException
Deprecated, use TET_get_tetml().

Throws:
TETException - TET output cannot be finished after an exception.

get_tetml

public final byte[] get_tetml(int doc,
                              java.lang.String optlist)
                       throws TETException
Retrieve TETML from memory.

Throws:
TETException - TET output cannot be finished after an exception.

delete

public final void delete()
Delete a TET context and release all its internal resources. This should be called for cleanup when processing is done, and after a TETException occurred. This method may also be called by the finalizer, but it is safe to issue multiple calls.


get_char_info

public final int get_char_info(int page)
                        throws TETException
Get detailed information for the next character in the most recent text fragment.

Throws:
TETException - May throw an exception for various reasons.

get_color_info

public final int get_color_info(int doc,
                                int colorid,
                                java.lang.String keyword)
                         throws TETException
Get detailed information for a color id which has been retrieved with TET_get_char_info.

Throws:
TETException - May throw an exception for various reasons.

get_image_info

public final int get_image_info(int page)
                         throws TETException
Retrieve information about the next image on the page (but not the actual pixel data).

Throws:
TETException - May throw an exception for various reasons.