Package com.pdflib

Class TET

  • All Implemented Interfaces:
    IpCOS

    public final class TET
    extends java.lang.Object
    implements IpCOS
    Text and Image Extraction Toolkit (TET): Toolkit for extracting Text, Images, and Metadata from PDF Documents.

    Note that this is only a syntax summary. For complete information please refer to the TET API reference manual which is available in the "doc" directory of the TET distribution.

    Version:
    5.3
    Author:
    Rainer Schaaf
    • Field Detail

      • uv

        public int uv
        UTF-32 Unicode value of the current character.
      • type

        public int type
        Character type; see CT_NORMAL etc. for possible values.
      • unknown

        public boolean unknown
        Indicates whether the glyph could be mapped to Unicode.
      • attributes

        public int attributes
        Glyph attributes; see ATTR_NONE etc. for possible values.
      • x

        public double x
        x position of the glyph's or image's reference point.

        x/y describe the position of the glyph's or image's reference point.

        Text: The reference point is the lower left corner of the glyph box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.

        Images: The reference point is the lower left corner of the image.

      • y

        public double y
        y position of the glyph's or image's reference point.
        See Also:
        x
      • width

        public double width
        Width of glyph or image.

        Text: Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial characters the width will be 0.

        Images: Width of the image on the page in points, measured along the image's edges

      • height

        public double height
        Height of glyph or image.
        See Also:
        width
      • alpha

        public double alpha
        Direction of inline text progression or direction of the pixel rows.

        Text: Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard -90° direction. The angle will be in the range -180° < alpha <= +180°. For standard horizontal text as well as for standard text in vertical writing mode the angle will be 0°.

        Images: Direction of the pixel rows. The angle will be in the range -180° < alpha <= +180°. For upright images alpha will be 0°.

      • beta

        public double beta
        Text slanting angle or direction of pixel columns relative to the perpendicular of alpha.

        Text: Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range -180° < beta <= 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.

        Images: Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the range -180° < beta <= +180°, but different from ±90°. For upright images beta will be in the range -90° < beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.

      • imageid

        public int imageid
        Index of the image in the pCOS pseudo object images[].

        Detailed image properties can be retrieved via the entries in this pseudo object.

      • fontid

        public int fontid
        Index of the font in the fonts[] pseudo object.

        fontid is never negative.

      • fontsize

        public double fontsize
        Size of the font (always positive).

        The relation of this value to the actual height of glyphs is not fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses all ascenders (including accented characters) and descenders.

      • textrendering

        public int textrendering
        Text rendering mode; see TR_FILL etc. for possible values.
      • colorid

        public int colorid
        Color id of the fill and stroke color.
      • colorspaceid

        public int colorspaceid
        Colorspace id or -1.
      • patternid

        public int patternid
        Pattern id or -1.
      • components

        public double[] components
        Color components.
      • ATTR_DEHYPHENATION_ARTIFACT

        public static final int ATTR_DEHYPHENATION_ARTIFACT
        Property reported in attributes by get_char_info(int): hyphenation character, i.e. soft hyphen (unrelated to Tagged PDF Artifact).
        See Also:
        Constant Field Values
    • Constructor Detail

      • TET

        public TET()
            throws TETException
        Create a new TET object.
        Throws:
        TETException - May throw an exception in case of memory shortage.
    • Method Detail

      • close_document

        public final void close_document​(int doc)
                                  throws TETException
        Release a document handle and all internal resources related to that document
        Parameters:
        doc - doc
        Throws:
        TETException - TET output cannot be finished after an exception.
      • close_page

        public final void close_page​(int page)
                              throws TETException
        Release a page handle and all related resources.
        Parameters:
        page - page
        Throws:
        TETException - TET output cannot be finished after an exception.
      • convert_to_unicode

        public final java.lang.String convert_to_unicode​(java.lang.String inputformat,
                                                         byte[] inputstring,
                                                         java.lang.String optlist)
                                                  throws TETException
        Convert a string in an arbitrary encoding to a Unicode string in various formats.
        Specified by:
        convert_to_unicode in interface IpCOS
        Parameters:
        inputformat - inputformat
        inputstring - inputstring
        optlist - optlist
        Returns:
        The converted Unicode string.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • create_pvf

        public final void create_pvf​(java.lang.String filename,
                                     byte[] data,
                                     java.lang.String optlist)
                              throws TETException
        Create a named virtual read-only file from data provided in memory.
        Specified by:
        create_pvf in interface IpCOS
        Parameters:
        filename - filename
        data - data
        optlist - optlist
        Throws:
        TETException - TET output cannot be finished after an exception.
      • delete_pvf

        public final int delete_pvf​(java.lang.String filename)
                             throws TETException
        Delete a named virtual file and free its data structures.
        Specified by:
        delete_pvf in interface IpCOS
        Parameters:
        filename - filename
        Returns:
        -1 if the virtual file exists but is locked, and 1 otherwise.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • get_apiname

        public final java.lang.String get_apiname()
        Get the name of the API function which caused an exception or failed.
        Specified by:
        get_apiname in interface IpCOS
        Returns:
        Name of an API function.
      • get_errmsg

        public final java.lang.String get_errmsg()
        Get the text of the last thrown exception or the reason for a failed function call.
        Specified by:
        get_errmsg in interface IpCOS
        Returns:
        Text containing the description of the most recent error condition.
      • get_errnum

        public final int get_errnum()
        Get the number of the last thrown exception or the reason for a failed function call.
        Specified by:
        get_errnum in interface IpCOS
        Returns:
        Error number of the most recent error condition.
      • get_image_data

        public final byte[] get_image_data​(int doc,
                                           int imageid,
                                           java.lang.String optlist)
                                    throws TETException
        Write image data to memory.
        Parameters:
        doc - doc
        imageid - imageid
        optlist - optlist
        Returns:
        Data representing the image according to the specified options.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • get_text

        public final java.lang.String get_text​(int page)
                                        throws TETException
        Get the next text fragment from a page's content.
        Parameters:
        page - page
        Returns:
        A string containing the next text fragment on the page.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • info_pvf

        public final double info_pvf​(java.lang.String filename,
                                     java.lang.String keyword)
                              throws TETException
        Query properties of a virtual file or the PDFlib Virtual Filesystem (PVF).
        Specified by:
        info_pvf in interface IpCOS
        Parameters:
        filename - filename
        keyword - keyword
        Returns:
        The value of some file parameter as requested by keyword.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • open_document

        public final int open_document​(java.lang.String filename,
                                       java.lang.String optlist)
                                throws TETException
        Open a disk-based or virtual PDF document for content extraction.
        Parameters:
        filename - filename
        optlist - optlist
        Returns:
        -1 on error, or a document handle otherwise.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • open_page

        public final int open_page​(int doc,
                                   int pagenumber,
                                   java.lang.String optlist)
                            throws TETException
        Open a page for text extraction.
        Parameters:
        doc - doc
        pagenumber - pagenumber
        optlist - optlist
        Returns:
        A handle for the page, or -1 in case of an error.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • pcos_get_number

        public final double pcos_get_number​(int doc,
                                            java.lang.String path)
                                     throws TETException
        Get the value of a pCOS path with type number or boolean.
        Specified by:
        pcos_get_number in interface IpCOS
        Parameters:
        doc - doc
        path - path
        Returns:
        The numerical value of the object identified by the pCOS path.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • pcos_get_string

        public final java.lang.String pcos_get_string​(int doc,
                                                      java.lang.String path)
                                               throws TETException
        Get the value of a pCOS path with type name, number, string, or boolean.
        Specified by:
        pcos_get_string in interface IpCOS
        Parameters:
        doc - doc
        path - path
        Returns:
        A string with the value of the object identified by the pCOS path.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • pcos_get_stream

        public final byte[] pcos_get_stream​(int doc,
                                            java.lang.String optlist,
                                            java.lang.String path)
                                     throws TETException
        Get the contents of a pCOS path with type stream, fstream, or string.
        Specified by:
        pcos_get_stream in interface IpCOS
        Parameters:
        doc - doc
        optlist - optlist
        path - path
        Returns:
        The unencrypted data contained in the stream or string.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • set_option

        public final void set_option​(java.lang.String optlist)
                              throws TETException
        Set one or more global options for TET.
        Specified by:
        set_option in interface IpCOS
        Parameters:
        optlist - optlist
        Throws:
        TETException - TET output cannot be finished after an exception.
      • write_image_file

        public final int write_image_file​(int doc,
                                          int imageid,
                                          java.lang.String optlist)
                                   throws TETException
        Write image data to disk.
        Parameters:
        doc - doc
        imageid - imageid
        optlist - optlist
        Returns:
        -1 on error, or the image format otherwise (see IF_TIFF etc.)
        Throws:
        TETException - TET output cannot be finished after an exception.
      • process_page

        public final int process_page​(int doc,
                                      int pageno,
                                      java.lang.String optlist)
                               throws TETException
        Process a page and create TETML output.
        Parameters:
        doc - doc
        pageno - pageno
        optlist - optlist
        Returns:
        Always 1. PDF problems are reported in a TETML Exception element.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • get_tetml

        public final byte[] get_tetml​(int doc,
                                      java.lang.String optlist)
                               throws TETException
        Retrieve TETML data from memory.
        Parameters:
        doc - doc
        optlist - optlist
        Returns:
        A byte array containing the next chunk of TETML data.
        Throws:
        TETException - TET output cannot be finished after an exception.
      • delete

        public final void delete()
        Delete a TET context and release all its internal resources. This should be called for cleanup when processing is done, and after a TETException occurred. This method may also be called by the finalizer, but it is safe to issue multiple calls.
        Specified by:
        delete in interface IpCOS
      • get_char_info

        public final int get_char_info​(int page)
                                throws TETException
        Get detailed information for the next character in the most recent text fragment; the results are reported in public fields.
        Parameters:
        page - page
        Returns:
        Bindig-specific error or success code.
        Throws:
        TETException - May throw an exception for various reasons.
      • get_color_info

        public final int get_color_info​(int doc,
                                        int colorid,
                                        java.lang.String keyword)
                                 throws TETException
        Get detailed information for a color id which has been retrieved with TET_get_char_info(); the results are reported in public fields.
        Parameters:
        doc - doc
        colorid - colorid
        keyword - keyword
        Returns:
        Details about the requested color space and color. various reasons.
        Throws:
        TETException - May throw an exception for
      • get_image_info

        public final int get_image_info​(int page)
                                 throws TETException
        Retrieve information about the next image on the page (but not the actual pixel data); the results are reported in public fields.
        Parameters:
        page - page
        Returns:
        Details about the next image on the page.
        Throws:
        TETException - May throw an exception for various reasons.
      • pcos_open_document

        public int pcos_open_document​(java.lang.String filename,
                                      java.lang.String optlist)
                               throws java.lang.Exception
        Open a disk-based or virtual PDF document via the IpCOS interface.
        Specified by:
        pcos_open_document in interface IpCOS
        Parameters:
        filename - The full path name of the PDF file to be opened. The file will be searched by means of the SearchPath resource.
        optlist - An option list specifying document options.
        Returns:
        A document handle.
        Throws:
        java.lang.Exception - see manual
      • pcos_close_document

        public void pcos_close_document​(int doc,
                                        java.lang.String optlist)
                                 throws java.lang.Exception
        Close PLOP input document via the IpCOS interface.
        Specified by:
        pcos_close_document in interface IpCOS
        Parameters:
        doc - A valid document handle obtained with open_document().
        optlist - An option list specifying document options.
        Throws:
        java.lang.Exception - see manual