|
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.pdflib.TET
public final class TET
Text and Image Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.
Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.
Field Summary | |
---|---|
double |
alpha
Direction of inline text progression or direction of the pixel rows. |
int |
attributes
Glyph attributes expressed as bits. |
double |
beta
Text slanting angle or direction of pixel columns relative to the perpendicular of alpha. |
int |
colorid
unique text color id |
int |
colorspaceid
colorspace id or -1 |
double[] |
components
color components |
int |
fontid
Index of the font in the fonts[] pseudo object. |
double |
fontsize
Size of the font (always positive). |
double |
height
Height of glyph or image. |
int |
imageid
Index of the image in the pCOS pseudo object images[]. |
int |
patternid
pattern id or -1 |
int |
textrendering
Text rendering mode. |
int |
type
Type of the character. |
boolean |
unknown
Indicates whether the glyph could be mapped to Unicode. |
int |
uv
UTF-32 Unicode value of the current character. |
double |
width
Width of glyph or image. |
double |
x
x position of the glyph's or image's reference point. |
double |
y
y position of the glyph's or image's reference point. |
Constructor Summary | |
---|---|
TET()
Create a new TET object. |
Method Summary | |
---|---|
void |
close_document(int doc)
Release a document handle and all internal resources related to that document |
void |
close_page(int page)
Release a page handle and all related resources. |
java.lang.String |
convert_to_unicode(java.lang.String inputformat,
byte[] inputstring,
java.lang.String optlist)
Convert a string in an arbitrary encoding to a Unicode string in various formats. |
void |
create_pvf(java.lang.String filename,
byte[] data,
java.lang.String optlist)
Create a named virtual read-only file from data provided in memory. |
int |
delete_pvf(java.lang.String filename)
Delete a named virtual file and free its data structures. |
void |
delete()
Delete a TET context and release all its internal resources. |
java.lang.String |
get_apiname()
Get the name of the API function which caused an exception or failed. |
int |
get_char_info(int page)
Get detailed information for the next character in the most recent text fragment. |
int |
get_color_info(int doc,
int colorid,
java.lang.String keyword)
Get detailed information for a color id which has been retrieved with TET_get_char_info. |
java.lang.String |
get_errmsg()
Get the text of the last thrown exception or the reason for a failed function call. |
int |
get_errnum()
Get the number of the last thrown exception or the reason for a failed function call. |
byte[] |
get_image_data(int doc,
int imageid,
java.lang.String optlist)
Write image data to memory. |
int |
get_image_info(int page)
Retrieve information about the next image on the page (but not the actual pixel data). |
byte[] |
get_tetml(int doc,
java.lang.String optlist)
Retrieve TETML data from memory. |
java.lang.String |
get_text(int page)
Get the next text fragment from a page's content. |
byte[] |
get_xml_data(int doc,
java.lang.String optlist)
Deprecated. use TET_get_tetml(). |
double |
info_pvf(java.lang.String filename,
java.lang.String keyword)
Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF). |
int |
open_document(java.lang.String filename,
java.lang.String optlist)
Open a disk-based or virtual PDF document for content extraction. |
int |
open_page(int doc,
int pagenumber,
java.lang.String optlist)
Open a page for text extraction. |
void |
pcos_close_document(int doc,
java.lang.String optlist)
Close TET input document via the IpCOS interface. |
double |
pcos_get_number(int doc,
java.lang.String path)
Get the value of a pCOS path with type number or boolean. |
byte[] |
pcos_get_stream(int doc,
java.lang.String optlist,
java.lang.String path)
Get the contents of a pCOS path with type stream, fstream, or string. |
java.lang.String |
pcos_get_string(int doc,
java.lang.String path)
Get the value of a pCOS path with type name, number, string, or boolean. |
int |
pcos_open_document(java.lang.String filename,
java.lang.String optlist)
Open TET input document via the IpCOS interface. |
int |
process_page(int doc,
int pageno,
java.lang.String optlist)
Process a page and create TETML output. |
void |
set_option(java.lang.String optlist)
Set one or more global options for TET. |
int |
write_image_file(int doc,
int imageid,
java.lang.String optlist)
Write image data to disk. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public int uv
It will be 0 if the corresponding UTF-16 value is the trailing value of a surrogate pair (i.e. if type=11).
public int type
The following types describe real characters which correspond to a glyph on the page. The values of all other properties/fields are determined by the corresponding glyph:
public boolean unknown
Usually false, but will be true if the original glyph could not be mapped to Unicode and has been replaced with the character specified as unknownchar.
public int attributes
The bits can be combined:
public double x
x/y describe the position of the glyph's or image's reference point.
Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.
Images: The reference point is the lower left corner of the image.
public double y
x
public double width
Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.
Images: Width of the image on the page in points, measured along the image's edges
public double height
width
public double alpha
Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.
Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.
public double beta
Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.
Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.
public int imageid
Detailed image properties can be retrieved via the entries in this pseudo object.
public int fontid
fontid is never negative.
public double fontsize
The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.
public int textrendering
public int colorid
public int colorspaceid
public int patternid
public double[] components
Constructor Detail |
---|
public TET() throws TETException
TETException
- May throw an exception in case
of memory shortage.Method Detail |
---|
public final void close_document(int doc) throws TETException
TETException
- TET output cannot be finished after an exception.public final void close_page(int page) throws TETException
TETException
- TET output cannot be finished after an exception.public final java.lang.String convert_to_unicode(java.lang.String inputformat, byte[] inputstring, java.lang.String optlist) throws TETException
convert_to_unicode
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final void create_pvf(java.lang.String filename, byte[] data, java.lang.String optlist) throws TETException
create_pvf
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final int delete_pvf(java.lang.String filename) throws TETException
delete_pvf
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final java.lang.String get_apiname()
get_apiname
in interface IpCOS
public final java.lang.String get_errmsg()
get_errmsg
in interface IpCOS
public final int get_errnum()
get_errnum
in interface IpCOS
public final byte[] get_image_data(int doc, int imageid, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final java.lang.String get_text(int page) throws TETException
TETException
- TET output cannot be finished after an exception.public final double info_pvf(java.lang.String filename, java.lang.String keyword) throws TETException
info_pvf
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final int open_document(java.lang.String filename, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final int open_page(int doc, int pagenumber, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final double pcos_get_number(int doc, java.lang.String path) throws TETException
pcos_get_number
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final java.lang.String pcos_get_string(int doc, java.lang.String path) throws TETException
pcos_get_string
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final byte[] pcos_get_stream(int doc, java.lang.String optlist, java.lang.String path) throws TETException
pcos_get_stream
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final void set_option(java.lang.String optlist) throws TETException
set_option
in interface IpCOS
TETException
- TET output cannot be finished after an exception.public final int write_image_file(int doc, int imageid, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final int process_page(int doc, int pageno, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final byte[] get_xml_data(int doc, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final byte[] get_tetml(int doc, java.lang.String optlist) throws TETException
TETException
- TET output cannot be finished after an exception.public final void delete()
delete
in interface IpCOS
public final int get_char_info(int page) throws TETException
TETException
- May throw an exception for
various reasons.public final int get_color_info(int doc, int colorid, java.lang.String keyword) throws TETException
TETException
- May throw an exception for
various reasons.public final int get_image_info(int page) throws TETException
TETException
- May throw an exception for
various reasons.public int pcos_open_document(java.lang.String filename, java.lang.String optlist) throws java.lang.Exception
pcos_open_document
in interface IpCOS
java.lang.Exception
public void pcos_close_document(int doc, java.lang.String optlist) throws java.lang.Exception
pcos_close_document
in interface IpCOS
java.lang.Exception
|
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |