Ubuntu Manpage: unicharambigs - Tesseract unicharset ambiguities

Provided by: tesseract-ocr_5.1.0-2_amd64

NAME

       unicharambigs - Tesseract unicharset ambiguities

DESCRIPTION

       The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by
       Tesseract to represent possible ambiguities between characters, or groups of characters.

       The file contains a number of lines, laid out as follow:

           [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]

       Field one     the number of characters
                     contained in field two

       Field two     the character sequence to be
                     replaced

       Field three   the number of characters
                     contained in field four

       Field four    the character sequence used to
                     replace field two

       Field five    contains either 1 or 0. 1
                     denotes a mandatory replacement,
                     0 denotes an optional
                     replacement.

       Characters appearing in fields two and four should appear in unicharset. The numbers in
       fields one and three refer to the number of unichars (not bytes).

EXAMPLE

           v1
           2       ' '     1       "     1
           1       m       2       r n   0
           3       i i i   1       m     0

       The first line is a version identifier. In this example, all instances of the 2 character
       sequence '' will always be replaced by the 1 character sequence "; a 1 character sequence
       m may be replaced by the 2 character sequence rn, and the 3 character sequence may be
       replaced by the 1 character sequence m.

       Version 3.03 and on supports a new, simpler format for the unicharambigs file:

           v2
           '' " 1
           m rn 0
           iii m 0

       In this format, the "error" and "correction" are simple UTF-8 strings separated by a
       space, and, after another space, the same type specifier as v1 (0 for optional and 1 for
       mandatory substitution). Note the downside of this simpler format is that Tesseract has to
       encode the UTF-8 strings into the components of the unicharset. In complex scripts, this
       encoding may be ambiguous. In this case, the encoding is chosen such as to use the least
       UTF-8 characters for each component, ie the shortest unicharset components will make up
       the encoding.

HISTORY

       The unicharambigs file first appeared in Tesseract 3.00; prior to that, a similar format,
       called DangAmbigs (dangerous ambiguities) was used: the format was almost identical,
       except only mandatory replacements could be specified, and field 5 was absent.

BUGS

       This is a documentation "bug": it’s not currently clear what should be done in the case of
       ligatures (such as fi) which may also appear as regular letters in the unicharset.

AUTHOR

       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett
       Packard (1985-1995) and Google (2006-present).

                                            06/02/2022                           UNICHARAMBIGS(5)

NAME

DESCRIPTION

EXAMPLE

HISTORY

BUGS

SEE ALSO

AUTHOR