com.pdflib
Class TET

java.lang.Object
  extended by com.pdflib.TET

public final class TET
extends java.lang.Object

Text Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.

Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.

Version:
4.2p1
Author:
Rainer Schaaf

Field Summary
 double alpha
          Direction of inline text progression or direction of the pixel rows.
 int attributes
          Glyph attributes expressed as bits.
 double beta
          Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.
 int fontid
          Index of the font in the fonts[] pseudo object.
 double fontsize
          Size of the font (always positive).
 double height
          Height of glyph or image.
 int imageid
          Index of the image in the pCOS pseudo object images[].
 int textrendering
          Text rendering mode.
 int type
          Type of the character.
 boolean unknown
          Indicates whether the glyph could be mapped to Unicode.
 int uv
          UTF-32 Unicode value of the current character.
 double width
          Width of glyph or image.
 double x
          x position of the glyph's or image's reference point.
 double y
          y position of the glyph's or image's reference point.
 
Constructor Summary
TET()
          Create a new TET object.
 
Method Summary
 void close_document(int doc)
          Release a document handle and all internal resources related to that document.
 void close_page(int page)
          Release a page handle and all related resources.
 java.lang.String convert_to_unicode(java.lang.String inputformat, byte[] input, java.lang.String optlist)
          Convert a string in an arbitrary encoding to a Unicode string in various formats.
 void create_pvf(java.lang.String filename, byte[] data, java.lang.String optlist)
          Create a named virtual read-only file from data provided in memory.
 int delete_pvf(java.lang.String filename)
          Delete a named virtual file and free its data structures (but not the contents).
 void delete()
          Delete a TET context and release all its internal resources.
 java.lang.String get_apiname()
          Get the name of the API function which caused an exception or failed.
 int get_char_info(int page)
          Get detailed information for the next character in the most recent text fragment.
 java.lang.String get_errmsg()
          Get the text of the last thrown exception or the reason for a failed function call.
 int get_errnum()
          Get the number of the last thrown exception or the reason for a failed function call.
 byte[] get_image_data(int doc, int imageid, java.lang.String optlist)
          Retrieve image data in memory.
 int get_image_info(int page)
          Retrieve information about the next image on the page (but not the actual pixel data).
 java.lang.String get_text(int page)
          Get the next text fragment from a page's content.
 byte[] get_xml_data(int doc, java.lang.String optlist)
          Retrieve data from memory.
 double info_pvf(java.lang.String filename, java.lang.String keyword)
          Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF)
 int open_document_mem(byte[] data, java.lang.String optlist)
          Deprecated. use TET_create_pvf( ) and TET_open_document( ).
 int open_document(java.lang.String filename, java.lang.String optlist)
          Open a PDF document from file for text extraction.
 int open_page(int doc, int pageno, java.lang.String optlist)
          Open a page for text extraction.
 double pcos_get_number(int doc, java.lang.String path)
          Get the value of a pCOS path with type number or boolean.
 byte[] pcos_get_stream(int doc, java.lang.String optlist, java.lang.String path)
          Get the contents of a pCOS path with type stream or fstream.
 java.lang.String pcos_get_string(int doc, java.lang.String path)
          Get the value of a pCOS path with type name, string or boolean.
 int process_page(int doc, int pageno, java.lang.String optlist)
          Process a page and create TETML output.
 void set_option(java.lang.String optlist)
          Set one or more global options for TET.
 int write_image_file(int doc, int imageid, java.lang.String optlist)
          Write image data to disk.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

uv

public int uv
UTF-32 Unicode value of the current character.

It will be 0 if the corresponding UTF-16 value is the trailing value of a surrogate pair (i.e. if type=11).


type

public int type
Type of the character.

The following types describe real characters which correspond to a glyph on the page. The values of all other properties/fields are determined by the corresponding glyph:

The following types describe artificial characters which do not correspond to a glyph on the page. The x and y fields will specify the most recent real character.s endpoint, the width field will be 0, and all other fields except uv will contain the values corresponding to the most recent real character:


unknown

public boolean unknown
Indicates whether the glyph could be mapped to Unicode.

Usually false, but will be true if the original glyph could not be mapped to Unicode and has been replaced with the character specified as unknownchar.


attributes

public int attributes
Glyph attributes expressed as bits.

The bits can be combined:


x

public double x
x position of the glyph's or image's reference point.

x/y describe the position of the glyph's or image's reference point.

Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.

Images: The reference point is the lower left corner of the image.


y

public double y
y position of the glyph's or image's reference point.

See Also:
x

width

public double width
Width of glyph or image.

Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.

Images: Width of the image on the page in points, measured along the image's edges


height

public double height
Height of glyph or image.

See Also:
width

alpha

public double alpha
Direction of inline text progression or direction of the pixel rows.

Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.

Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.


beta

public double beta
Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.

Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.

Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.


imageid

public int imageid
Index of the image in the pCOS pseudo object images[].

Detailed image properties can be retrieved via the entries in this pseudo object.


fontid

public int fontid
Index of the font in the fonts[] pseudo object.

fontid is never negative.


fontsize

public double fontsize
Size of the font (always positive).

The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.


textrendering

public int textrendering
Text rendering mode.
Constructor Detail

TET

public TET()
    throws TETException
Create a new TET object.

Throws:
TETException - May throw an exception in case of memory shortage.
Method Detail

close_document

public final void close_document(int doc)
                          throws TETException
Release a document handle and all internal resources related to that document.

Throws:
TETException - May throw an exception for various reasons.

close_page

public final void close_page(int page)
                      throws TETException
Release a page handle and all related resources.

Throws:
TETException - May throw an exception for various reasons.

create_pvf

public final void create_pvf(java.lang.String filename,
                             byte[] data,
                             java.lang.String optlist)
                      throws TETException
Create a named virtual read-only file from data provided in memory.

Throws:
TETException - May throw an exception for various reasons.

delete

public final void delete()
Delete a TET context and release all its internal resources. This should be called for cleanup when processing is done, and after a TETException occurred. This method may also be called by the finalizer, but it is safe to issue multiple calls.


delete_pvf

public final int delete_pvf(java.lang.String filename)
                     throws TETException
Delete a named virtual file and free its data structures (but not the contents).

Returns:
-1 if the virtual file exists but is locked, and 1 otherwise.
Throws:
TETException - May throw an exception for various reasons.

get_apiname

public final java.lang.String get_apiname()
Get the name of the API function which caused an exception or failed.

Returns:
TETlib API function name

get_errmsg

public final java.lang.String get_errmsg()
Get the text of the last thrown exception or the reason for a failed function call.

Returns:
TETlib error message

get_errnum

public final int get_errnum()
Get the number of the last thrown exception or the reason for a failed function call.

Returns:
TETlib error number

get_char_info

public final int get_char_info(int page)
                        throws TETException
Get detailed information for the next character in the most recent text fragment.

Throws:
TETException - May throw an exception for various reasons.

get_image_data

public final byte[] get_image_data(int doc,
                                   int imageid,
                                   java.lang.String optlist)
                            throws TETException
Retrieve image data in memory.

Returns:
The data representing the image.
Throws:
TETException - May throw an exception for various reasons.

get_image_info

public final int get_image_info(int page)
                         throws TETException
Retrieve information about the next image on the page (but not the actual pixel data).

Throws:
TETException - May throw an exception for various reasons.

get_text

public final java.lang.String get_text(int page)
                                throws TETException
Get the next text fragment from a page's content.

Returns:
A string containing the next text fragment from the page.
Throws:
TETException - May throw an exception for various reasons.

info_pvf

public final double info_pvf(java.lang.String filename,
                             java.lang.String keyword)
                      throws TETException
Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF)

Returns:
The value of some file parameter as requested by keyword.
Throws:
TETException - May throw an exception for various reasons.

open_document

public final int open_document(java.lang.String filename,
                               java.lang.String optlist)
                        throws TETException
Open a PDF document from file for text extraction.

Returns:
-1 on error, and 1 otherwise
Throws:
TETException - May throw an exception for various reasons.

open_document_mem

public final int open_document_mem(byte[] data,
                                   java.lang.String optlist)
                            throws TETException
Deprecated. use TET_create_pvf( ) and TET_open_document( ).

Throws:
TETException - May throw an exception for various reasons.

open_page

public final int open_page(int doc,
                           int pageno,
                           java.lang.String optlist)
                    throws TETException
Open a page for text extraction.

Returns:
-1 on error, and 1 otherwise
Throws:
TETException - May throw an exception for various reasons.

pcos_get_number

public final double pcos_get_number(int doc,
                                    java.lang.String path)
                             throws TETException
Get the value of a pCOS path with type number or boolean.

Returns:
The numerical value of the parameter.
Throws:
TETException - May throw an exception for various reasons.

pcos_get_string

public final java.lang.String pcos_get_string(int doc,
                                              java.lang.String path)
                                       throws TETException
Get the value of a pCOS path with type name, string or boolean.

Returns:
The parameter's string value
Throws:
TETException - May throw an exception for various reasons.

pcos_get_stream

public final byte[] pcos_get_stream(int doc,
                                    java.lang.String optlist,
                                    java.lang.String path)
                             throws TETException
Get the contents of a pCOS path with type stream or fstream.

Returns:
The data contained in the stream.
Throws:
TETException - May throw an exception for various reasons.

set_option

public final void set_option(java.lang.String optlist)
                      throws TETException
Set one or more global options for TET.

Throws:
TETException - May throw an exception for various reasons.

write_image_file

public final int write_image_file(int doc,
                                  int imageid,
                                  java.lang.String optlist)
                           throws TETException
Write image data to disk.

Returns:
-1 on error, and 1 otherwise
Throws:
TETException - May throw an exception for various reasons.

process_page

public final int process_page(int doc,
                              int pageno,
                              java.lang.String optlist)
                       throws TETException
Process a page and create TETML output.

Returns:
-1 on error, and 1 otherwise
Throws:
TETException - May throw an exception for various reasons.

get_xml_data

public final byte[] get_xml_data(int doc,
                                 java.lang.String optlist)
                          throws TETException
Retrieve data from memory.

Returns:
A string containing the next chunk of data.
Throws:
TETException - May throw an exception for various reasons.

convert_to_unicode

public final java.lang.String convert_to_unicode(java.lang.String inputformat,
                                                 byte[] input,
                                                 java.lang.String optlist)
                                          throws TETException
Convert a string in an arbitrary encoding to a Unicode string in various formats.

Returns:
The converted Unicode string.
Throws:
TETException - May throw an exception for various reasons.