|
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.pdflib.TET
public final class TET
Text and Image Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.
Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.
Field Summary | |
---|---|
double |
alpha
Direction of inline text progression or direction of the pixel rows. |
static int |
ATTR_ANNOTATION
Property reported in attributes by get_image_info(int) : image extracted from an annotation (appearance stream). |
static int |
ATTR_ARTIFACT
Property reported in attributes by get_char_info(int) and get_image_info(int) : text or image marked as Artifact (irrelevant content) in Tagged PDF. |
static int |
ATTR_DEHYPHENATION_ARTIFACT
Property reported in attributes by get_char_info(int) : hyphenation character, i.e. |
static int |
ATTR_DEHYPHENATION_POST
Property reported in attributes by get_char_info(int) : character after hyphenation. |
static int |
ATTR_DEHYPHENATION_PRE
Property reported in attributes by get_char_info(int) : character before hyphenation. |
static int |
ATTR_DROPCAP
Property reported in attributes by get_char_info(int) : initial large letter. |
static int |
ATTR_NONE
Property reported in attributes by get_char_info(int) : no attribute set. |
static int |
ATTR_PATTERN
Property reported in attributes by get_image_info(int) : image extracted from a pattern. |
static int |
ATTR_SHADOW
Property reported in attributes by get_char_info(int) : shadowed text. |
static int |
ATTR_SOFTMASK
Property reported in attributes by get_image_info(int) : image extracted from from a soft mask in a graphics state (defined in a Transparency Group XObject). |
static int |
ATTR_SUB
Property reported in attributes by get_char_info(int) : subscript. |
static int |
ATTR_SUP
Property reported in attributes by get_char_info(int) : superscript. |
int |
attributes
Glyph attributes; see ATTR_NONE etc. |
double |
beta
Text slanting angle or direction of pixel columns relative to the perpendicular of alpha. |
int |
colorid
Color id of the fill and stroke color. |
int |
colorspaceid
Colorspace id or -1. |
double[] |
components
Color components. |
static int |
CT_INSERTED
Type reported in type by get_char_info(int) : inserted word, line, or paragraph separator. |
static int |
CT_NORMAL
Type reported in type by get_char_info(int) : normal character represented by exactly one glyph. |
static int |
CT_SEQ_CONT
Type reported in type by get_char_info(int) : continuation of a sequence. |
static int |
CT_SEQ_START
Type reported in type by get_char_info(int) : start of a sequence, e.g. |
int |
fontid
Index of the font in the fonts[] pseudo object. |
double |
fontsize
Size of the font (always positive). |
double |
height
Height of glyph or image. |
static int |
IF_J2K
Image format returned by write_image_file(int, int, java.lang.String) : raw JPEG 2000 code stream, *.j2k. |
static int |
IF_JBIG2
Image format returned by write_image_file(int, int, java.lang.String) : image/x-jbig2, *.jbig2. |
static int |
IF_JP2
Image format returned by write_image_file(int, int, java.lang.String) : image/jp2, *.jp2. |
static int |
IF_JPEG
Image format returned by write_image_file(int, int, java.lang.String) : format image/jpeg, *.jpg. |
static int |
IF_JPF
Image format returned by write_image_file(int, int, java.lang.String) : image/jpx, *.jpf. |
static int |
IF_TIFF
Image format returned by write_image_file(int, int, java.lang.String) : image/tiff, *.tif. |
int |
imageid
Index of the image in the pCOS pseudo object images[]. |
int |
patternid
Pattern id or -1. |
int |
textrendering
Text rendering mode; see TR_FILL etc. |
static int |
TR_CLIP
Text rendering mode reported in textrendering by get_char_info(int) : add text to the clipping path. |
static int |
TR_FILL
Text rendering mode reported in textrendering by get_char_info(int) : fill text. |
static int |
TR_FILL_CLIP
Text rendering mode reported in textrendering by get_char_info(int) : fill text and add it to the clipping path. |
static int |
TR_FILLSTROKE
Text rendering mode reported in textrendering by get_char_info(int) : fill and stroke text. |
static int |
TR_FILLSTROKE_CLIP
Text rendering mode reported in textrendering by get_char_info(int) : fill and stroke text and add it to the clipping path. |
static int |
TR_INVISIBLE
Text rendering mode reported in textrendering by get_char_info(int) : invisible text. |
static int |
TR_STROKE
Text rendering mode reported in textrendering by get_char_info(int) : stroke text (outline). |
static int |
TR_STROKE_CLIP
Text rendering mode reported in textrendering by get_char_info(int) : stroke text and add it to the clipping path. |
int |
type
Character type; see CT_NORMAL etc. |
boolean |
unknown
Indicates whether the glyph could be mapped to Unicode. |
int |
uv
UTF-32 Unicode value of the current character. |
double |
width
Width of glyph or image. |
double |
x
x position of the glyph's or image's reference point. |
double |
y
y position of the glyph's or image's reference point. |
Constructor Summary | |
---|---|
TET()
Create a new TET object. |
Method Summary | |
---|---|
void |
close_document(int doc)
Release a document handle and all internal resources related to that document |
void |
close_page(int page)
Release a page handle and all related resources. |
java.lang.String |
convert_to_unicode(java.lang.String inputformat,
byte[] inputstring,
java.lang.String optlist)
Convert a string in an arbitrary encoding to a Unicode string in various formats. |
void |
create_pvf(java.lang.String filename,
byte[] data,
java.lang.String optlist)
Create a named virtual read-only file from data provided in memory. |
int |
delete_pvf(java.lang.String filename)
Delete a named virtual file and free its data structures. |
void |
delete()
Delete a TET context and release all its internal resources. |
java.lang.String |
get_apiname()
Get the name of the API function which caused an exception or failed. |
int |
get_char_info(int page)
Get detailed information for the next character in the most recent text fragment; the results are reported in public fields. |
int |
get_color_info(int doc,
int colorid,
java.lang.String keyword)
Get detailed information for a color id which has been retrieved with TET_get_char_info(); the results are reported in public fields. |
java.lang.String |
get_errmsg()
Get the text of the last thrown exception or the reason for a failed function call. |
int |
get_errnum()
Get the number of the last thrown exception or the reason for a failed function call. |
byte[] |
get_image_data(int doc,
int imageid,
java.lang.String optlist)
Write image data to memory. |
int |
get_image_info(int page)
Retrieve information about the next image on the page (but not the actual pixel data); the results are reported in public fields. |
byte[] |
get_tetml(int doc,
java.lang.String optlist)
Retrieve TETML data from memory. |
java.lang.String |
get_text(int page)
Get the next text fragment from a page's content. |
double |
info_pvf(java.lang.String filename,
java.lang.String keyword)
Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF). |
int |
open_document(java.lang.String filename,
java.lang.String optlist)
Open a disk-based or virtual PDF document for content extraction. |
int |
open_page(int doc,
int pagenumber,
java.lang.String optlist)
Open a page for text extraction. |
void |
pcos_close_document(int doc,
java.lang.String optlist)
Close PLOP input document via the IpCOS interface. |
double |
pcos_get_number(int doc,
java.lang.String path)
Get the value of a pCOS path with type number or boolean. |
byte[] |
pcos_get_stream(int doc,
java.lang.String optlist,
java.lang.String path)
Get the contents of a pCOS path with type stream, fstream, or string. |
java.lang.String |
pcos_get_string(int doc,
java.lang.String path)
Get the value of a pCOS path with type name, number, string, or boolean. |
int |
pcos_open_document(java.lang.String filename,
java.lang.String optlist)
Open a disk-based or virtual PDF document via the IpCOS interface. |
int |
process_page(int doc,
int pageno,
java.lang.String optlist)
Process a page and create TETML output. |
void |
set_option(java.lang.String optlist)
Set one or more global options for TET. |
int |
write_image_file(int doc,
int imageid,
java.lang.String optlist)
Write image data to disk. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public int uv
public int type
CT_NORMAL
etc. for possible values.
public boolean unknown
public int attributes
ATTR_NONE
etc. for possible values.
public double x
x/y describe the position of the glyph's or image's reference point.
Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.
Images: The reference point is the lower left corner of the image.
public double y
x
public double width
Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.
Images: Width of the image on the page in points, measured along the image's edges
public double height
width
public double alpha
Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.
Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.
public double beta
Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.
Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.
public int imageid
Detailed image properties can be retrieved via the entries in this pseudo object.
public int fontid
fontid is never negative.
public double fontsize
The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.
public int textrendering
TR_FILL
etc. for possible values.
public int colorid
public int colorspaceid
public int patternid
public double[] components
public static final int CT_NORMAL
type
by get_char_info(int)
: normal character represented by exactly one glyph.
public static final int CT_SEQ_START
type
by get_char_info(int)
: start of a sequence, e.g. ligature.
public static final int CT_SEQ_CONT
type
by get_char_info(int)
: continuation of a sequence.
public static final int CT_INSERTED
type
by get_char_info(int)
: inserted word, line, or paragraph separator.
public static final int ATTR_NONE
attributes
by get_char_info(int)
: no attribute set.
public static final int ATTR_SUB
attributes
by get_char_info(int)
: subscript.
public static final int ATTR_SUP
attributes
by get_char_info(int)
: superscript.
public static final int ATTR_DROPCAP
attributes
by get_char_info(int)
: initial large letter.
public static final int ATTR_SHADOW
attributes
by get_char_info(int)
: shadowed text.
public static final int ATTR_DEHYPHENATION_PRE
attributes
by get_char_info(int)
: character before hyphenation.
public static final int ATTR_DEHYPHENATION_ARTIFACT
attributes
by get_char_info(int)
: hyphenation character, i.e. soft hyphen (unrelated to Tagged PDF Artifact).
public static final int ATTR_DEHYPHENATION_POST
attributes
by get_char_info(int)
: character after hyphenation.
public static final int ATTR_ARTIFACT
attributes
by get_char_info(int)
and get_image_info(int)
: text or image marked as Artifact (irrelevant content) in Tagged PDF.
public static final int ATTR_ANNOTATION
attributes
by get_image_info(int)
: image extracted from an annotation (appearance stream).
public static final int ATTR_PATTERN
attributes
by get_image_info(int)
: image extracted from a pattern.
public static final int ATTR_SOFTMASK
attributes
by get_image_info(int)
: image extracted from from a soft mask in a graphics state (defined in a Transparency Group XObject).
public static final int TR_FILL
textrendering
by get_char_info(int)
: fill text.
public static final int TR_STROKE
textrendering
by get_char_info(int)
: stroke text (outline).
public static final int TR_FILLSTROKE
textrendering
by get_char_info(int)
: fill and stroke text.
public static final int TR_INVISIBLE
textrendering
by get_char_info(int)
: invisible text.
public static final int TR_FILL_CLIP
textrendering
by get_char_info(int)
: fill text and add it to the clipping path.
public static final int TR_STROKE_CLIP
textrendering
by get_char_info(int)
: stroke text and add it to the clipping path.
public static final int TR_FILLSTROKE_CLIP
textrendering
by get_char_info(int)
: fill and stroke text and add it to the clipping path.
public static final int TR_CLIP
textrendering
by get_char_info(int)
: add text to the clipping path.
public static final int IF_TIFF
write_image_file(int, int, java.lang.String)
: image/tiff, *.tif.
public static final int IF_JPEG
write_image_file(int, int, java.lang.String)
: format image/jpeg, *.jpg.
public static final int IF_JP2
write_image_file(int, int, java.lang.String)
: image/jp2, *.jp2.
public static final int IF_JPF
write_image_file(int, int, java.lang.String)
: image/jpx, *.jpf.
public static final int IF_J2K
write_image_file(int, int, java.lang.String)
: raw JPEG 2000 code stream, *.j2k.
public static final int IF_JBIG2
write_image_file(int, int, java.lang.String)
: image/x-jbig2, *.jbig2.
Constructor Detail |
---|
public TET() throws TETException
TETException
- May throw an exception in case
of memory shortage.Method Detail |
---|
public final void close_document(int doc) throws TETException
doc
- doc
TETException
- TET output cannot be finished after an exception.public final void close_page(int page) throws TETException
page
- page
TETException
- TET output cannot be finished after an exception.public final java.lang.String convert_to_unicode(java.lang.String inputformat, byte[] inputstring, java.lang.String optlist) throws TETException
convert_to_unicode
in interface IpCOS
inputformat
- inputformatinputstring
- inputstringoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final void create_pvf(java.lang.String filename, byte[] data, java.lang.String optlist) throws TETException
create_pvf
in interface IpCOS
filename
- filenamedata
- dataoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final int delete_pvf(java.lang.String filename) throws TETException
delete_pvf
in interface IpCOS
filename
- filename
TETException
- TET output cannot be finished after an exception.public final java.lang.String get_apiname()
get_apiname
in interface IpCOS
public final java.lang.String get_errmsg()
get_errmsg
in interface IpCOS
public final int get_errnum()
get_errnum
in interface IpCOS
public final byte[] get_image_data(int doc, int imageid, java.lang.String optlist) throws TETException
doc
- docimageid
- imageidoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final java.lang.String get_text(int page) throws TETException
page
- page
TETException
- TET output cannot be finished after an exception.public final double info_pvf(java.lang.String filename, java.lang.String keyword) throws TETException
info_pvf
in interface IpCOS
filename
- filenamekeyword
- keyword
TETException
- TET output cannot be finished after an exception.public final int open_document(java.lang.String filename, java.lang.String optlist) throws TETException
filename
- filenameoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final int open_page(int doc, int pagenumber, java.lang.String optlist) throws TETException
doc
- docpagenumber
- pagenumberoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final double pcos_get_number(int doc, java.lang.String path) throws TETException
pcos_get_number
in interface IpCOS
doc
- docpath
- path
TETException
- TET output cannot be finished after an exception.public final java.lang.String pcos_get_string(int doc, java.lang.String path) throws TETException
pcos_get_string
in interface IpCOS
doc
- docpath
- path
TETException
- TET output cannot be finished after an exception.public final byte[] pcos_get_stream(int doc, java.lang.String optlist, java.lang.String path) throws TETException
pcos_get_stream
in interface IpCOS
doc
- docoptlist
- optlistpath
- path
TETException
- TET output cannot be finished after an exception.public final void set_option(java.lang.String optlist) throws TETException
set_option
in interface IpCOS
optlist
- optlist
TETException
- TET output cannot be finished after an exception.public final int write_image_file(int doc, int imageid, java.lang.String optlist) throws TETException
doc
- docimageid
- imageidoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final int process_page(int doc, int pageno, java.lang.String optlist) throws TETException
doc
- docpageno
- pagenooptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final byte[] get_tetml(int doc, java.lang.String optlist) throws TETException
doc
- docoptlist
- optlist
TETException
- TET output cannot be finished after an exception.public final void delete()
delete
in interface IpCOS
public final int get_char_info(int page) throws TETException
page
- page
TETException
- May throw an exception for
various reasons.public final int get_color_info(int doc, int colorid, java.lang.String keyword) throws TETException
doc
- doccolorid
- coloridkeyword
- keyword
TETException
- May throw an exception forpublic final int get_image_info(int page) throws TETException
page
- page
TETException
- May throw an exception for
various reasons.public int pcos_open_document(java.lang.String filename, java.lang.String optlist) throws java.lang.Exception
pcos_open_document
in interface IpCOS
filename
- The full path name of the PDF file to be opened.
The file will be searched by means of the SearchPath resource.optlist
- An option list specifying document options.
java.lang.Exception
- see manualpublic void pcos_close_document(int doc, java.lang.String optlist) throws java.lang.Exception
pcos_close_document
in interface IpCOS
doc
- A valid document handle obtained with open_document().optlist
- An option list specifying document options.
java.lang.Exception
- see manual
|
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |