Advantages for Text Extraction

PDFlib TET 5 - Unique Advantages for Text Extraction

 

Dehyphenation

Text with hyphenation and dash character

TET detects hyphenated words which span multiple lines, removes the hyphen, and combines the individual parts to form a complete word. This is important to ensure that searches for the full word are successful although hyphenated parts are present in the document. Dashes (different from hyphens) are treated separately since they must not be removed.

 

TET correctly removes the hyphen, but keeps the dash

 
 

Shadow and artifical bold Text Detection

Text with shadow effect

Digital documents often contain shadowed text where the shadow effect is achieved by placing the same text multiply on the page, using a small offset between the instances of text. Similarly, bold text is often simulated by overprinting the same text. As a result, the document contains the characters in the shadowed or bold word more than once. TET’s patented shadow detection algorithm identifies and removes redundant instances of text to avoid excess text extraction. While other software will extract the shadowed or bold text multiply, TET correctly removes the redundant copies. While extra instances of a word still result in a search engine hit, no more hits would be found if the text is duplicated character by character as in the example.

Other products extract »Inttrroduccttiion«

TET extracts »Introduction«

 
 

Accented Characters

Text with diacritical marks

In many languages accents and other diacritical marks are placed close to other characters to form combined characters. Some typesetting programs, e.g. notably TeX, emit two separate characters (base character and accent) to create a combined character. For example, to create the character ä first the letter a is placed on the page, and then the dieresis character ¨ is placed on top of it. TET detects this situation and combines both characters to form the appropriate composite character.

Other products extract »Midi-Pyr´en´ees«

TET extracts »Midi-Pyrénées«

 
 

Ligatures

Text with T+h and f+i ligatures

Ligatures combine two or more characters in a single glyph. The most common ligatures are those for the combinations fi, fl, and ffi; less common ligatures are used for Th, sp, ct, st, and many others. When extracting text from digital documents, ligatures must be analyzed and separated to the constituent characters to allow proper text processing. TET detects ligatures and delivers two or more characters as appropriate. TET can optionally preserve ligatures if required.

Other products extract » e rst photographs«

TET extracts »The first photographs«

 
 

Drop Caps

Text with drop cap character

Drop caps are large initial characters at the beginning of a paragraph where the top of the initial aligns with the top of the line, and the remainder of the character drops down several lines. Drop caps are used to emphasize the start of a paragraph. If they are not treated properly the initial word is extracted in two parts: the single initial character and the remainder of the word.

Other products extract two words: the drop cap »S« and »tellen«.

TET correctly extracts the single word »Stellen«.

 
 

Unicode Mapping

Text without Unicode mapping results in garbage characters

Unicode mapping forms the foundation of PDF text extraction: every glyph on the page must be assigned the corresponding Unicode value. PDF complicates this tasks by supporting a variety of font and encoding variants which may or may not provide the information required to assign proper Unicode values. In the worst case the document does not provide enough information with the result that no usable text can be extracted from the document.

TET’s patented Unicode mapping algorithm implements a cascaded algorithm which takes all available pieces of information in order to determine Unicode values. For many problematic documents TET extracts proper Unicode text where other products deliver only unusable garbage.

Other products extract unusable garbage, while TET delivers text.

 
 

Bidirectional Text with Arabic and Hebrew

Bidirectional Hebrew and English text PDF does not encode logical text, but is simply a container for glyphs on the page. Text in the Arabic and Hebrew script runs from right to left. Since it often contains left-to-right inserts such as numbers or names in Western languages, text must be interpreted in both directions - hence the term »bidirectional«. Arabic poses additional challenges since the characters are used in up to four different contextual forms. These shaped forms of characters must be normalized to the corresponding standard (isolated) form.

TET reorders the visual mixture of right-to-left and left-to-right text to create proper logical text output.

 
 

Damaged PDF Documents

Screenshot: Acrobat error message for damaged document PDF documents may get damaged because of transmission errors or other problems. TET’s repair mode recovers many kinds of damaged PDFs. Sometimes PDF documents are damaged so heavily that the pages cannot even be displayed in Acrobat. Even in such extreme cases TET often delivers the page contents of the document.

The page contents are not even displayed in Acrobat, but TET still correctly extracts the text.