Provided by: uni2ascii_4.18-5_amd64 bug

NAME

       uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII representations

SYNOPSIS

       uni2ascii [options] (<input file name>)

DESCRIPTION

       uni2ascii  converts  UTF-8  Unicode  to  various  7-bit ASCII representations. If no format is specified,
       standard hexadecimal format (e.g. 0x00e9) is used.  It reads from the standard input and  writes  to  the
       standard output.

       Command line options are:

       -A     List the single character approximations carried out by the -y flag.

       -a <format>
              Convert  to  the  specified  format.  Formats may be specified by means of the following arbitrary
              single character codes, by means of names such as "SGML_decimal", and by examples of  the  desired
              format.

              A Generate hexadecimal numbers with prefix U in angle-brackets (<U00E9>).

              B Generate \x-escaped hex (e.g. \x00E9)

              C Generate \x escaped hexadecimal numbers in braces (e.g. \x{00E9}).

              D Generate decimal HTML numeric character references (e.g. &#0233;)

              E Generate hexadecimal with prefix U (U00E9).

              F Generate hexadecimal with prefix u (u00E9).

              G Convert hexadecimal in single quotes with prefix X (e.g. X'00E9').

              H Generate hexadecimal HTML numeric character references (e.g. &#x00E9;)

              I  Generate  hexadecimal  UTF-8 with each byte's hex preceded by an =-sign (e.g. =C3=A9) . This is
              the Quoted Printable format defined by RFC 2045.

              J Generate hexadecimal UTF-8 with each byte's hex preceded by a %-sign (e.g.  %C3%A9). This is the
              URI escape format defined by RFC 2396.

              K Generate octal UTF-8 with each byte escaped by a backslash (e.g.  \303\251)

              L Generate \U-escaped hex outside the BMP, \u-escaped hex within the BMP (U+0000-U+FFFF).

              M Generate hexadecimal SGML numeric character references (e.g. \#xE9;)

              N Generate decimal SGML numeric character references (e.g. \#233;)

              O Generate octal escapes for the three low bytes in big-endian order(e.g. \000\000\351))

              P Generate hexadecimal numbers with prefix U+ (e.g. U+00E9)

              Q  Generate  character  entities  (e.g.  &eacute;)  where  possible, otherwise hexadecimal numeric
              character references.

              R Generate raw hexadecimal numbers (e.g. 00E9)

              S Generate hexadecimal escapes for the three low bytes in big-endian order (e.g. \x00\x00\xE9)

              T Generate decimal escapes for the three low bytes in big-endian order (e.g. \d000\d000\d233)

              U Generate \u-escaped hexadecimal numbers (e.g. \u00E9).

              V Generate \u-escaped decimal numbers (e.g. \u00233).

              X Generate standard hexadecimal numbers (e.g. 0x00E9).

              0 Generate hexadecimal UTF-8 with each byte's hex enclosed within angle brackets (e.g. <C3><A9>).

              1 Generate Common Lisp format hexadecimal numbers (e.g. #x00E9).

              2 Generate Perl format decimal numbers with prefix v (e.g. v233).

              3 Generate hexadecimal numbers with prefix $ (e.g. $00E9).

              4 Generate Postscript format hexadecimal numbers with prefix 16# (e.g. 16#00E9).

              5 Generate Common Lisp format hexadecimal numbers with prefix #16r (e.g. #16r00E9).

              6 Generate ADA format hexadecimal numbers with prefix 16# and suffix # (e.g. 16#00E9#).

              7 Generate Apache log format hexadecimal UTF-8 with each byte's  hex  preceded  by  a  backslash-x
              (e.g.  \xC3\xA9).

              8 Generate Microsoft OOXML format hexadecimal numbers with prefix _x and suffix _ (e.g. _x00E9_).

              9 Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9).

       -B     Transform to ASCII if possible. This option is equivalent to the combination cdefx.

       -c     Convert circled and parenthesized characters to their unenclosed counterparts.

       -d     Strip  diacritics.  This converts single codepoints representing characters with diacritics to the
              corresponding ASCII character and deletes separately encoded diacritics.

       -e     Convert characters to their approximate ASCII equivalents, as follows:
              U+0085  next line                                   0x0A  newline
              U+00A0  no break space                              0x20  space
              U+00AB  left-pointing double angle quotation mark   0x22  double quote
              U+00AD  soft hyphen                                 0x2D  minus
              U+00AF  macron                                      0x2D  minus
              U+00B7  middle dot                                  0x2E  period
              U+00BB  right-pointing double angle quotation mark  0x22  double quote
              U+1361  ethiopic word space                         0x20  space
              U+1680  ogham space                                 0x20  space
              U+2000  en quad                                     0x20  space
              U+2001  em quad                                     0x20  space
              U+2002  en space                                    0x20  space
              U+2003  em space                                    0x20  space
              U+2004  three-per-em space                          0x20  space
              U+2005  four-per-em space                           0x20  space
              U+2006  six-per-em space                            0x20  space
              U+2007  figure space                                0x20  space
              U+2008  punctuation space                           0x20  space
              U+2009  thin space                                  0x20  space
              U+200A  hair space                                  0x20  space
              U+200B  zero-width space                            0x20  space
              U+2010  hyphen                                      0x2D  minus
              U+2011  non-breaking hyphen                         0x2D  minus
              U+2012  figure dash                                 0x2D  minus
              U+2013  en dash                                     0x2D  minus
              U+2014  em dash                                     0x2D  minus
              U+2018  left single quotation mark                  0x60  left single quote
              U+2019  right single quotation mark                 0x27  right or neutral single quote
              U+201A  single low-9 quotation mark                 0x60  left single quote
              U+201B  single high-reversed-9 quotation mark       0x60  left single quote
              U+201C  left double quotation mark                  0x22  double quote
              U+201D  right double quotation mark                 0x22  double quote
              U+201E  double low-9 quotation mark                 0x22  double quote
              U+201F  double high-reversed-9 quotation mark       0x22  double quote
              U+2022  bullet                                      0x6F  small letter o
              U+2028  line separator                              0x0A  newline
              U+2033  double prime                                0x22  double quote
              U+2039  single left-pointing angle quotation mark   0x60  left single quote
              U+203A  single right-pointing angle quotation mark  0x27  right or neutral single quote
              U+204E  low asterisk                                0x2A  asterisk
              U+2212  minus sign                                  0x2D  minus
              U+2216  set minus                                   0x5C  backslash
              U+2217  asterisk operator                           0x2A  asterisk
              U+2223  divides                                     0x7C  vertical line
              U+2500  box drawing light horizontal                0x2D  minus
              U+2501  box drawing heavy horizontal                0x2D  minus
              U+2502  box drawing light vertical                  0x7C  vertical line
              U+2503  box drawing heavy vertical                  0x7C  vertical line
              U+2731  heavy asterisk                              0x2A  asterisk
              U+275D  heavy double turned comma quotation mark    0x22  double quote
              U+275E  heavy double comma quotation mark           0x22  double quote
              U+3000  ideographic space                           0x20  space
              U+FE60  small ampersand                             0x26  ampersand
              U+FE61  small asterisk                              0x2A  asterisk
              U+FE62  small plus sign                             0x2B  plus sign

       -E     List the expansions performed by the -x flag.

       -f     Convert stylistic variants  to  plain  ASCII.   Stylistic  equivalents  include:  superscript  and
              subscript  forms,  small  capitals  (e.g.  U+1D04), script forms (e.g. U+212C), black letter forms
              (e.g. U+212D), fullwidth forms (e.g. U+FF01), halfwidth forms (e.g. U+FF7B), and the  mathematical
              alphanumeric symbols (e.g. U+1D400).

       -h     Help. Print the usage message and exit.

       -l     Use lowercase a-f when generating hexadecimal numbers.

       -n     Convert newlines too. By default, they are left alone.

       -P     Pass  through Unicode rather than converting to ASCII escapes if the character is not converted to
              an ASCII character by a transformation such as diacritic stripping. Note that if  this  option  is
              used the output may not be pure ASCII.

       -p     Pure.  Convert  characters  within  the  ASCII range except for space and newline as well as those
              above.

       -q     Quiet. Do not chat unnecessarily while working.

       -s     Convert space characters too. By default, they are left alone.

       -S <Unicode:ASCII>
              Define a custom substitution. The argument should consist of the Unicode codepoint to be  replaced
              followed by the ASCII code of the character to be used as replacement, separated by a colon. If no
              ASCII code follows the colon, the specified Unicode character will be deleted.   The  code  values
              may  be  in hexadecimal, octal, or decimal following the usual conventions (to be precise,those of
              strtoul(3)).   This  option  may  be  repeated  as  many  times  as  desired  to  define  multiple
              substitutions.

       -v     Print program version information and exit.

       -w     Add a space after each converted item.

       -x     Expand  certain  characters  to multicharacter sequences.  The characters affected are the same as
              those affected by the -y option.
              U+00A2 CENT SIGN                        -> cent
              U+00A3 POUND SIGN                       -> pound
              U+00A5 YEN SIGN                         -> yen
              U+00A9 COPYRIGHT SYMBOL                 -> (c)
              U+00AE REGISTERED SYMBOL                -> (R)
              U+00BC ONE QUARTER                      -> 1/4
              U+00BD ONE HALF                         -> 1/2
              U+00BE THREE QUARTERS                   -> 3/4
              U+00C6 CAPITAL LETTER ASH               -> AE
              U+00DF SMALL LETTER SHARP S             -> ss
              U+00E6 SMALL LETTER ASH                 -> ae
              U+0132 LIGATURE IJ                      -> IJ
              U+0133 LIGATURE ij                      -> ij
              U+0152 LIGATURE OE                      -> OE
              U+0153 LIGATURE oe                      -> oe
              U+01F1 CAPITAL LETTER DZ                -> DZ
              U+01F2 MIXED LETTER Dz                  -> Dz
              U+01F3 SMALL LETTER DZ                  -> dz
              U+02A6 SMALL LETTER TS DIGRAPH          -> ts
              U+2026 HORIZONTAL ELLIPSIS              -> ...
              U+20AC EURO SIGN                        -> euro
              U+22EF MIDLINE HORIZONTAL ELLIPSIS      -> ...
              U+2190 LEFTWARDS ARROW                  -> <-
              U+2192 RIGHTWARDS ARROW                 -> ->
              U+21D0 LEFTWARDS DOUBLE ARROW           -> <=
              U+21D2 RIGHTWARDS DOUBLE ARROW          -> =>
              U+FB00 LATIN SMALL LIGATURE FF          -> ff
              U+FB01 LATIN SMALL LIGATURE FI          -> fi
              U+FB02 LATIN SMALL LIGATURE FL          -> fl
              U+FB03 LATIN SMALL LIGATURE FFI         -> ffi
              U+FB04 LATIN SMALL LIGATURE FFL         -> ffl
              U+FB06 LATIN SMALL LIGATURE ST          -> st

       -y     Convert  certain  characters  having  multi-character   expansions   to   single-character   ascii
              approximations  instead  (e.g. to maintain character-positioning). The characters affected are the
              same as those affected by the -x option.
              U+00A2 CENT SIGN                        -> c
              U+00A3 POUND SIGN                       -> #
              U+00A5 YEN SIGN                         -> Y
              U+00A9 COPYRIGHT SYMBOL                 -> C
              U+00AE REGISTERED SYMBOL                -> R
              U+00BC ONE QUARTER                      -> -
              U+00BD ONE HALF                         -> -
              U+00BE THREE QUARTERS                   -> -
              U+00C6 CAPITAL LETTER ASH               -> A
              U+00DF SMALL LETTER SHARP S             -> s
              U+00E6 SMALL LETTER ASH                 -> a
              U+0132 LIGATURE IJ                      -> I
              U+0133 LIGATURE ij                      -> i
              U+0152 LIGATURE OE                      -> O
              U+0153 LIGATURE oe                      -> o
              U+01F1 CAPITAL LETTER DZ                -> D
              U+01F2 MIXED LETTER Dz                  -> D
              U+01F3 SMALL LETTER DZ                  -> d
              U+02A6 SMALL LETTER TS DIGRAPH          -> t
              U+2026 HORIZONTAL ELLIPSIS              -> .
              U+20AC EURO SIGN                        -> E
              U+22EF MIDLINE HORIZONTAL ELLIPSIS      -> .
              U+2190 LEFTWARDS ARROW                  -> <
              U+2192 RIGHTWARDS ARROW                 -> >
              U+21D0 LEFTWARDS DOUBLE ARROW           -> <
              U+21D2 RIGHTWARDS DOUBLE ARROW          -> >

       -Z <format>
              Generate output using the supplied format. The format specified will be used as the format  string
              in a call to printf(3) with a single argument consisting of an unsigned long integer. For example,
              to obtain the same output as with the -U flag, the format would be: \u%04X.

       If conversion of spaces is disabled (as it is by default), if space characters outside  the  ASCII  range
       are encountered (U+3000 ideographic space, U+1351 Ethiopic word space, and U+1680 ogham space mark), they
       are replaced with the ASCII space character (0x20) so as to keep the output pure 7-bit ASCII.

       Note that XML and XHTML numeric character entities are like those of HTML with two  restrictions.  First,
       in X(HT)ML the terminating semi-colon may not be omitted.  Second, in X(HT)ML the "x" must be lower-case,
       while in HTML it may be either upper- or lower-case. We always generate the  terminating  semi-colon  and
       use a lower-case "x", so the option dubbed "HTML" produces valid XML and XHTML as well.

EXIT STATUS

       The following values are returned on exit:

       0 SUCCESS
              The input was successfully converted.

       2 I/O ERROR
              A system error ocurred during input or output.

       3 INFO The  user  requested  information  such  as the version number or usage synopsis and this has been
              provided.

       5 BAD OPTION
              An incorrect option flag was given on the command line.

       8 BAD RECORD
              Ill-formed UTF-8 was detected in the input.

SEE ALSO

       ascii2uni(1), Text::Unidecode

AUTHOR

       Bill Poser <billposer@alum.mit.edu>

LICENSE

       GNU General Public License

                                                   April, 2011                                      uni2ascii(1)