Ubuntu Manpage: unicharambigs - Tesseract unicharset ambiguities

name
description
example
history
bugs
see also
author

Provided by: tesseract-ocr_4.00~git2288-10f4998a-2_amd64

NAME

       unicharambigs - Tesseract unicharset ambiguities

DESCRIPTION

       The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to
       represent possible ambiguities between characters, or groups of characters.

       The file contains a number of lines, laid out as follow:

           [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]

       Field one     the number of characters contained in
                     field two

       Field two     the character sequence to be replaced

       Field three   the number of characters contained in
                     field four

       Field four    the character sequence used to
                     replace field two

       Field five    contains either 1 or 0. 1 denotes a
                     mandatory replacement, 0 denotes an
                     optional replacement.

       Characters appearing in fields two and four should appear in unicharset. The numbers in fields one and
       three refer to the number of unichars (not bytes).

EXAMPLE

           2       ' '     1       "     1
           1       m       2       r n   0
           3       i i i   1       m     0

       In this example, all instances of the 2 character sequence '' will always be replaced by the 1 character
       sequence "; a 1 character sequence m may be replaced by the 2 character sequence rn, and the 3 character
       sequence may be replaced by the 1 character sequence m.

HISTORY

       The unicharambigs file first appeared in Tesseract 3.00; prior to that, a similar format, called
       DangAmbigs (dangerous ambiguities) was used: the format was almost identical, except only mandatory
       replacements could be specified, and field 5 was absent.

BUGS

       This is a documentation "bug": it’s not currently clear what should be done in the case of ligatures
       (such as fi) which may also appear as regular letters in the unicharset.

AUTHOR

       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995)
       and Google (2006-present).

                                                   04/07/2018                                   UNICHARAMBIGS(5)

NAME

DESCRIPTION

EXAMPLE

HISTORY

BUGS

SEE ALSO

AUTHOR