API compatibility
=================

TET versions and .NET Core versions
-----------------------------------

.NET Core introduced a separate versioning scheme; it does not apply to the
Classic .NET binding:

TET     .NET Core
--------------------
5.2px   1.0.x		(first version for .NET Core)
5.3px   2.0.x
5.4px   3.0.x
5.5px   4.0.x
5.6px   5.0.x
6.0px   6.0.x


The TET Migration Guide contains recommended replacements for deprecated
and removed options.


=======
TET 6.0
=======

- no compatibility relevant changes.

=======
TET 5.6
=======

- no compatibility relevant changes.

=======
TET 5.5
=======

- no compatibility relevant changes.

=======
TET 5.4
=======

- The following encodings are no longer built-in: iso8859-2 - iso8859-10,
  iso8859-13 - iso8859-14, cp1250, cp1251, cp1253, cp1254, cp1255, cp1256,
  cp1257, cp1258. These are supplied as code page files in the TET package
  under "resources/codepage" and can be loaded as encoding resources. Note
  that the removal of the cp* built-in codepages affects only non-Windows
  platforms. On Windows the cp* code pages are still available from the host
  system.


=======
TET 5.3
=======

- TET_set_option() with option "filenamehandling": the keyword "legacy" is
  no longer supported; use an explicit encoding name or "honorlang" instead.

- pCOS paths containing a # character followed by two hexadecimal characters
  (e.g. #28) behave differently. While previously they were treated as three
  literal characters, the sequence is now unquoted and results in a single
  character. The # character must be quoted as #23 in order to be used
  literally; see pCOS Path Reference for details.
  In the C language binding the literal sequence "%q" in a pCOS path (in
  addition to "%s" and "%d") also requires the first character to be quoted,
  i.e. "#25q".

- PHP binding
  - Removed the the functional interface which has long been declared as
    deprecated in favor of the object-oriented interface.
  - The delete() method (which was no-op anyway) is no longer available.

- C binding: the errorhandler callback of TET_new2() no longer uses the
  errortype parameter which has been unused since TET 4. The errortype
  parameter must be removed from all custom error handlers.

- The following deprecated options have been removed: 
  - TET_open_page() and TET_process_page():
    - option "contentanalysis": suboption "ideographic"
    - option "contentanalysis", suboptions "lineseparator", "paraseparator",
      "wordseparator"
      
  - TET_open_page() and TET_process_page(): option "skipengines"
  
  - TET_open_document():
    - option "keeppua"
    - option "tetml", suboption "elements", suboption "docxmp"
    
  - TET_write_image_file() and TET_get_image_data():
    - option "smallimages", suboption "maxcount"

- C++ binding
  Custom string converters are no longer supported.


=======================
TET 5.2 (July 19, 2019)
=======================

Deprecated Options
------------------
- The options "compression" and "preferredtiffcompression" of
  TET_write_image_file() and TET_get_image_data() are deprecated.

- Option "imageanalysis" of TET_open_page() and TET_process_page(): the
  suboption "smallimages" is deprecated; use heightrange/sizerange/widthrange.

- The suboption "docxmp" of the suboption "elements" of the document option
  "tetml" is deprecated since TET 5.0. Use "metadata" instead.


Removal of deprecated API Functions
-----------------------------------
- Removed the REALbasic/Xojo binding.

- Removed the following deprecated API methods from all language bindings:
  TET_utf8_to_utf16(), TET_utf16_to_utf8(),
  TET_utf32_to_utf16(), TET_utf8_to_utf32(),
  TET_utf32_to_utf8(), TET_utf16_to_utf32(),
  TET_get_xml_data()

- Removed the C macro TET_CT_SUR_TRAIL since this value has no longer been
  used since TET 4.0.
  
- Removed the option "format" of TET_write_image_file() and TET_get_image_data()
  (it was a no-op since TET 4.0).
  
- C++ binding: removed the following macro-controlled features for TET 3
  compatibility:
  - TETCPP_TET_WSTRING for disabling wstring support in favor of string.
  - TETCPP_USE_PDFLIB_NAMESPACE for disabling namespace support.


Incompatible Changes
--------------------

- Improved exception handling in the Python binding resulted in changes
  in the arguments of the TETException object. Previously it contained
  a single string containing the error number, method name and error
  description. These components are now available as separate members of
  the TETException object and can easily be used. For example, print(ex)
  prints the following triple:
  
  "(1400, 'set_option', "Unknown option 'search'")"

  instead of the previous combined message
  "TETlib TET Error [1400] set_option: Unknown option 'search'"

- TET_open_document_callback() did not work correctly with large files >2GB
  on Windows and z/OS. Fixing this required a small change in the declaration
  of TET_open_document_callback() and the seekproc callback function.
  
- It is now an error if a CMap is required for extracting text from a PDF but
  the corresponding CMap is not found. This is considered a configuration
  error which must be fixed for proper operation.


======================
TET 5.1 (May 24, 2017)
======================

- TETML may contain the new elements List, Item, Label, and Body. List
  detection is disabled by default, and can be enabled with the page option
  structureanalysis={list=true}. Since these are only extensions, the
  namespace and schema location attributes for the TETML schema remain
  unchanged; only the "version" attribute has been changed to "5.1".

- TET_create_pvf(): the default of the "copy" option has been changed from
  "false" to "true" for all language bindings except C/C++. This temporarily
  requires more memory, but avoids spurious memory problems in situations
  where the language's garbage collector no longer has a reference to the
  memory.

- Removed TET_open_document_mem() from all language bindings since it has
  been deprecated with TET 3.0. Applications which still use this old function
  should switch to TET_create_pvf() and TET_open_document().


===========================
TET 5.0 (November 04, 2015)
===========================

General
-------
- API function TET_get_xml_data() is deprecated, use TET_get_tetml() which
  has the same interface and semantics.

- The "skipengines" option of TET_open_page() is deprecated;
  use the "engines" option of TET_open_document().


CJK Text Extraction
---------------
- The default word splitting behavior for ideographic CJK characters has been
  changed from "split" to "keep". Therefore the suboption "ideographic" of
  the "contentanalysis" page option is no longer required, and has been
  declared as deprecated.
  

Image Extraction
----------------
- TET_write_image_file(): JPEG 2000 images are no longer reported with 
  return value 30 and suffix .jpx, but the function distinguishes between plain
  JPEG 2000 (return type 31, suffix .jp2), extended JPEG 2000 (return type
  32, suffix .jpf), and raw JPEG 2000 code streams (return type 33, suffix
  .j2k). Client code which checks for the image type must be adjusted
  accordingly.

- The unsupported option "format" of TET_write_image_file() and
  TET_get_image_data() is no longer available since the TET kernel needs full
  control over the choice of image output format.

- "smallimages" suboption of the "imageanalysis" page option: the suboption
  "maxcount" is deprecated.


TETML Schema
------------
- New namespace URI: TETML output created by TET 5 adheres to the new TETML
  schema TET-5.0.xsd. All XSLT applications for processing TETML 5 must apply
  the following change to switch to the new TETML schema:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:tet="http://www.pdflib.com/XML/TET3/TET-3.0">
==>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:tet="http://www.pdflib.com/XML/TET5/TET-5.0">


- Additional <Box> elements: The following elements now have an additional
  <Box> element as child (like <Word> in TET 4):
  Para, Table

  This requires changes in XSLT stylesheets which select direct parents of one
  of the affected elements, e.g. Words within Para.

  Example for TET 4 XSLT fragment:

  <xsl:template match="tet:Para">
      <xsl:for-each select="tet:Word">
          <!-- do something with tet:Word element -->
      </xsl:for-each>
  </xsl:template>

  XSLT code updated for TET 5:

  <xsl:template match="tet:Para">
      <xsl:for-each select="tet:Box/tet:Word">
          <!-- do something with tet:Word element -->
      </xsl:for-each>
  </xsl:template>



==========================
TET 4.4 (January 27, 2015)
==========================

- Removed the deprecated suboption "version" of the document option "tetml". 


======================
TET 4.3 (May 26, 2014)
======================

(No compatibility notes)


======================
TET 4.2 (May 10, 2013)
======================

- Maintenance releases require a suitable license key which is available only
  for customers with active support.


===========================
TET 4.1 (February 20, 2012)
===========================

- Perl binding: if an API function returns UTF-8 (which is the default for
  TET_get_text()) the returned Perl string will now be flagged as UTF-8.
  As a result, Perl functions (e.g. length()) count the Unicode characters
  in the string instead of the number of bytes.
  If you get a warning such as the following when writing to file
  
  "Wide character in print at extractor.pl line 76."
  
  you must tell Perl that the output file contains UTF-8 as follows:
  
  binmode(OUTFP, ":utf8");
  
  (see http://perldoc.perl.org/functions/binmode.html for details).

- PHP binding: the name of the TET extension for PHP changed from
  libtet_php.(so|dll|sl) to php_tet.(so|dll|sl).

- The following functions are deprecated:
  TET_utf8_to_utf16(), TET_utf16_to_utf8(),
  TET_utf32_to_utf16(), TET_utf8_to_utf32(),
  TET_utf32_to_utf8(), TET_utf16_to_utf32()
  Use TET_convert_to_unicode() instead.


=======================
TET 4.0 (July 27, 2010)
=======================

- TET_open_page() and TET_process_page(): the following suboptions for the
  contentanalysis option are deprecated:
  lineseparator, paraseparator, wordseparator
  
  Use the corresponding option in TET_open_document() instead.

- TET_open_document(): the option "keeppua" is deprecated, use the
  following instead:

  fold={{[:Private_Use:] preserve}} or
  fold={{[:Private_Use:] unknownchar}}
  
- TET_get_char_info():
  There is no longer any fixed relationship between glyphs (as represented
  by the TET_char_info structure and characters in the Unicode text
  returned by TET_get_text(). Instead, the set of glyphs for a text chunk
  as a whole is known to generate the sequence of Unicode characters
  comprising the chunk.

- type member in the TET_char_info structure: type=11 (trailing value of
  a surrogate pair) is no longer used since there is no longer any 1:1
  relationship between Unicode values and TET_char_info structures.

- TET_open_document_*():
  the "version" suboption of the "tetml" option is deprecated.


===========================
TET 3.0 (February 02, 2009)
===========================

- The "zoneseparator" suboption of the "contentanalysis" option of
  TET_open_page() is no longer supported.

- TET_open_document_mem() is deprecated; use PVF and TET_open_document().

- TET 2 XML output has been replaced by a more powerful grammar which is
  described by a suitable schema. The old XML grammar can be enabled with
  the option "tetml={version=2}" in TET_open_document().


==========================
TET 2.2 (January 24, 2007)
==========================

- Switched to the new license scheme and keys which has been
  introduced with PDFlib 7.0.0.
  
  
=============================
TET 2.1.0 (December 12, 2005)
=============================

- Option "outputformat" in TET_set_option(): changed the default value on
  zSeries from "utf8" to "ebcdicutf8" (the default on all other systems
  remains "utf8").
  In order to restore the previous behavior issue the following call:
  TET_set_option(p, "outputformat", "utf8");
