Provided by: uni2ascii_4.18-2_amd64 bug

NAME

       uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII representations

SYNOPSIS

       uni2ascii [options] (<input file name>)

DESCRIPTION

       uni2ascii  converts  UTF-8 Unicode to various 7-bit ASCII representations. If no format is
       specified, standard hexadecimal format (e.g. 0x00e9) is used.  It reads from the  standard
       input and writes to the standard output.

       Command line options are:

       -A     List the single character approximations carried out by the -y flag.

       -a <format>
              Convert to the specified format. Formats may be specified by means of the following
              arbitrary single character codes, by means of names such as "SGML_decimal", and  by
              examples of the desired format.

              A Generate hexadecimal numbers with prefix U in angle-brackets (<U00E9>).

              B Generate \x-escaped hex (e.g. \x00E9)

              C Generate \x escaped hexadecimal numbers in braces (e.g. \x{00E9}).

              D Generate decimal HTML numeric character references (e.g. &#0233;)

              E Generate hexadecimal with prefix U (U00E9).

              F Generate hexadecimal with prefix u (u00E9).

              G Convert hexadecimal in single quotes with prefix X (e.g. X'00E9').

              H Generate hexadecimal HTML numeric character references (e.g. &#x00E9;)

              I  Generate  hexadecimal  UTF-8  with  each  byte's hex preceded by an =-sign (e.g.
              =C3=A9) . This is the Quoted Printable format defined by RFC 2045.

              J Generate hexadecimal UTF-8 with each  byte's  hex  preceded  by  a  %-sign  (e.g.
              %C3%A9). This is the URI escape format defined by RFC 2396.

              K Generate octal UTF-8 with each byte escaped by a backslash (e.g.  \303\251)

              L  Generate  \U-escaped  hex  outside  the  BMP,  \u-escaped  hex  within  the  BMP
              (U+0000-U+FFFF).

              M Generate hexadecimal SGML numeric character references (e.g. \#xE9;)

              N Generate decimal SGML numeric character references (e.g. \#233;)

              O Generate  octal  escapes  for  the  three  low  bytes  in  big-endian  order(e.g.
              \000\000\351))

              P Generate hexadecimal numbers with prefix U+ (e.g. U+00E9)

              Q Generate character entities (e.g. &eacute;) where possible, otherwise hexadecimal
              numeric character references.

              R Generate raw hexadecimal numbers (e.g. 00E9)

              S Generate hexadecimal escapes for the three low bytes in  big-endian  order  (e.g.
              \x00\x00\xE9)

              T  Generate  decimal  escapes  for  the  three  low bytes in big-endian order (e.g.
              \d000\d000\d233)

              U Generate \u-escaped hexadecimal numbers (e.g. \u00E9).

              V Generate \u-escaped decimal numbers (e.g. \u00233).

              X Generate standard hexadecimal numbers (e.g. 0x00E9).

              0 Generate hexadecimal UTF-8 with each byte's hex enclosed  within  angle  brackets
              (e.g. <C3><A9>).

              1 Generate Common Lisp format hexadecimal numbers (e.g. #x00E9).

              2 Generate Perl format decimal numbers with prefix v (e.g. v233).

              3 Generate hexadecimal numbers with prefix $ (e.g. $00E9).

              4 Generate Postscript format hexadecimal numbers with prefix 16# (e.g. 16#00E9).

              5 Generate Common Lisp format hexadecimal numbers with prefix #16r (e.g. #16r00E9).

              6  Generate  ADA  format  hexadecimal  numbers  with  prefix 16# and suffix # (e.g.
              16#00E9#).

              7 Generate Apache log format hexadecimal UTF-8 with each byte's hex preceded  by  a
              backslash-x (e.g.  \xC3\xA9).

              8  Generate  Microsoft OOXML format hexadecimal numbers with prefix _x and suffix _
              (e.g. _x00E9_).

              9 Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9).

       -B     Transform to ASCII if possible. This option is equivalent to the combination cdefx.

       -c     Convert circled and parenthesized characters to their unenclosed counterparts.

       -d     Strip diacritics. This converts  single  codepoints  representing  characters  with
              diacritics  to  the  corresponding  ASCII  character and deletes separately encoded
              diacritics.

       -e     Convert characters to their approximate ASCII equivalents, as follows:
              U+0085  next line                                   0x0A  newline
              U+00A0  no break space                              0x20  space
              U+00AB  left-pointing double angle quotation mark   0x22  double quote
              U+00AD  soft hyphen                                 0x2D  minus
              U+00AF  macron                                      0x2D  minus
              U+00B7  middle dot                                  0x2E  period
              U+00BB  right-pointing double angle quotation mark  0x22  double quote
              U+1361  ethiopic word space                         0x20  space
              U+1680  ogham space                                 0x20  space
              U+2000  en quad                                     0x20  space
              U+2001  em quad                                     0x20  space
              U+2002  en space                                    0x20  space
              U+2003  em space                                    0x20  space
              U+2004  three-per-em space                          0x20  space
              U+2005  four-per-em space                           0x20  space
              U+2006  six-per-em space                            0x20  space
              U+2007  figure space                                0x20  space
              U+2008  punctuation space                           0x20  space
              U+2009  thin space                                  0x20  space
              U+200A  hair space                                  0x20  space
              U+200B  zero-width space                            0x20  space
              U+2010  hyphen                                      0x2D  minus
              U+2011  non-breaking hyphen                         0x2D  minus
              U+2012  figure dash                                 0x2D  minus
              U+2013  en dash                                     0x2D  minus
              U+2014  em dash                                     0x2D  minus
              U+2018  left single quotation mark                  0x60  left single quote
              U+2019  right single quotation mark                 0x27  right or  neutral  single
              quote
              U+201A  single low-9 quotation mark                 0x60  left single quote
              U+201B  single high-reversed-9 quotation mark       0x60  left single quote
              U+201C  left double quotation mark                  0x22  double quote
              U+201D  right double quotation mark                 0x22  double quote
              U+201E  double low-9 quotation mark                 0x22  double quote
              U+201F  double high-reversed-9 quotation mark       0x22  double quote
              U+2022  bullet                                      0x6F  small letter o
              U+2028  line separator                              0x0A  newline
              U+2033  double prime                                0x22  double quote
              U+2039  single left-pointing angle quotation mark   0x60  left single quote
              U+203A   single  right-pointing angle quotation mark  0x27  right or neutral single
              quote
              U+204E  low asterisk                                0x2A  asterisk
              U+2212  minus sign                                  0x2D  minus
              U+2216  set minus                                   0x5C  backslash
              U+2217  asterisk operator                           0x2A  asterisk
              U+2223  divides                                     0x7C  vertical line
              U+2500  box drawing light horizontal                0x2D  minus
              U+2501  box drawing heavy horizontal                0x2D  minus
              U+2502  box drawing light vertical                  0x7C  vertical line
              U+2503  box drawing heavy vertical                  0x7C  vertical line
              U+2731  heavy asterisk                              0x2A  asterisk
              U+275D  heavy double turned comma quotation mark    0x22  double quote
              U+275E  heavy double comma quotation mark           0x22  double quote
              U+3000  ideographic space                           0x20  space
              U+FE60  small ampersand                             0x26  ampersand
              U+FE61  small asterisk                              0x2A  asterisk
              U+FE62  small plus sign                             0x2B  plus sign

       -E     List the expansions performed by the -x flag.

       -f     Convert  stylistic  variants  to  plain  ASCII.   Stylistic  equivalents   include:
              superscript  and  subscript forms, small capitals (e.g. U+1D04), script forms (e.g.
              U+212C), black letter forms (e.g. U+212D), fullwidth forms (e.g. U+FF01), halfwidth
              forms (e.g. U+FF7B), and the mathematical alphanumeric symbols (e.g. U+1D400).

       -h     Help. Print the usage message and exit.

       -l     Use lowercase a-f when generating hexadecimal numbers.

       -n     Convert newlines too. By default, they are left alone.

       -P     Pass  through  Unicode  rather than converting to ASCII escapes if the character is
              not converted  to  an  ASCII  character  by  a  transformation  such  as  diacritic
              stripping. Note that if this option is used the output may not be pure ASCII.

       -p     Pure.  Convert  characters  within  the ASCII range except for space and newline as
              well as those above.

       -q     Quiet. Do not chat unnecessarily while working.

       -s     Convert space characters too. By default, they are left alone.

       -S <Unicode:ASCII>
              Define a custom substitution. The argument should consist of the Unicode  codepoint
              to  be  replaced  followed  by  the  ASCII  code  of  the  character  to be used as
              replacement, separated by a  colon.  If  no  ASCII  code  follows  the  colon,  the
              specified   Unicode  character  will  be  deleted.   The  code  values  may  be  in
              hexadecimal, octal, or decimal following the usual conventions (to be precise,those
              of  strtoul(3)).   This  option  may be repeated as many times as desired to define
              multiple substitutions.

       -v     Print program version information and exit.

       -w     Add a space after each converted item.

       -x     Expand certain characters to multicharacter sequences.  The characters affected are
              the same as those affected by the -y option.
              U+00A2 CENT SIGN                        -> cent
              U+00A3 POUND SIGN                       -> pound
              U+00A5 YEN SIGN                         -> yen
              U+00A9 COPYRIGHT SYMBOL                 -> (c)
              U+00AE REGISTERED SYMBOL                -> (R)
              U+00BC ONE QUARTER                      -> 1/4
              U+00BD ONE HALF                         -> 1/2
              U+00BE THREE QUARTERS                   -> 3/4
              U+00C6 CAPITAL LETTER ASH               -> AE
              U+00DF SMALL LETTER SHARP S             -> ss
              U+00E6 SMALL LETTER ASH                 -> ae
              U+0132 LIGATURE IJ                      -> IJ
              U+0133 LIGATURE ij                      -> ij
              U+0152 LIGATURE OE                      -> OE
              U+0153 LIGATURE oe                      -> oe
              U+01F1 CAPITAL LETTER DZ                -> DZ
              U+01F2 MIXED LETTER Dz                  -> Dz
              U+01F3 SMALL LETTER DZ                  -> dz
              U+02A6 SMALL LETTER TS DIGRAPH          -> ts
              U+2026 HORIZONTAL ELLIPSIS              -> ...
              U+20AC EURO SIGN                        -> euro
              U+22EF MIDLINE HORIZONTAL ELLIPSIS      -> ...
              U+2190 LEFTWARDS ARROW                  -> <-
              U+2192 RIGHTWARDS ARROW                 -> ->
              U+21D0 LEFTWARDS DOUBLE ARROW           -> <=
              U+21D2 RIGHTWARDS DOUBLE ARROW          -> =>
              U+FB00 LATIN SMALL LIGATURE FF          -> ff
              U+FB01 LATIN SMALL LIGATURE FI          -> fi
              U+FB02 LATIN SMALL LIGATURE FL          -> fl
              U+FB03 LATIN SMALL LIGATURE FFI         -> ffi
              U+FB04 LATIN SMALL LIGATURE FFL         -> ffl
              U+FB06 LATIN SMALL LIGATURE ST          -> st

       -y     Convert  certain  characters  having multi-character expansions to single-character
              ascii  approximations  instead  (e.g.  to  maintain   character-positioning).   The
              characters affected are the same as those affected by the -x option.
              U+00A2 CENT SIGN                        -> c
              U+00A3 POUND SIGN                       -> #
              U+00A5 YEN SIGN                         -> Y
              U+00A9 COPYRIGHT SYMBOL                 -> C
              U+00AE REGISTERED SYMBOL                -> R
              U+00BC ONE QUARTER                      -> -
              U+00BD ONE HALF                         -> -
              U+00BE THREE QUARTERS                   -> -
              U+00C6 CAPITAL LETTER ASH               -> A
              U+00DF SMALL LETTER SHARP S             -> s
              U+00E6 SMALL LETTER ASH                 -> a
              U+0132 LIGATURE IJ                      -> I
              U+0133 LIGATURE ij                      -> i
              U+0152 LIGATURE OE                      -> O
              U+0153 LIGATURE oe                      -> o
              U+01F1 CAPITAL LETTER DZ                -> D
              U+01F2 MIXED LETTER Dz                  -> D
              U+01F3 SMALL LETTER DZ                  -> d
              U+02A6 SMALL LETTER TS DIGRAPH          -> t
              U+2026 HORIZONTAL ELLIPSIS              -> .
              U+20AC EURO SIGN                        -> E
              U+22EF MIDLINE HORIZONTAL ELLIPSIS      -> .
              U+2190 LEFTWARDS ARROW                  -> <
              U+2192 RIGHTWARDS ARROW                 -> >
              U+21D0 LEFTWARDS DOUBLE ARROW           -> <
              U+21D2 RIGHTWARDS DOUBLE ARROW          -> >

       -Z <format>
              Generate output using the supplied format. The format specified will be used as the
              format string in a call to printf(3)  with  a  single  argument  consisting  of  an
              unsigned  long integer. For example, to obtain the same output as with the -U flag,
              the format would be: \u%04X.

       If conversion of spaces is disabled (as it is by default), if space characters outside the
       ASCII  range  are  encountered  (U+3000 ideographic space, U+1351 Ethiopic word space, and
       U+1680 ogham space mark), they are replaced with the ASCII space character (0x20) so as to
       keep the output pure 7-bit ASCII.

       Note  that  XML  and  XHTML  numeric  character  entities  are like those of HTML with two
       restrictions. First, in X(HT)ML the terminating semi-colon may not be omitted.  Second, in
       X(HT)ML  the  "x" must be lower-case, while in HTML it may be either upper- or lower-case.
       We always generate the terminating semi-colon and use a  lower-case  "x",  so  the  option
       dubbed "HTML" produces valid XML and XHTML as well.

EXIT STATUS

       The following values are returned on exit:

       0 SUCCESS
              The input was successfully converted.

       2 I/O ERROR
              A system error ocurred during input or output.

       3 INFO The  user  requested  information  such as the version number or usage synopsis and
              this has been provided.

       5 BAD OPTION
              An incorrect option flag was given on the command line.

       8 BAD RECORD
              Ill-formed UTF-8 was detected in the input.

SEE ALSO

       ascii2uni(1), Text::Unidecode

AUTHOR

       Bill Poser <billposer@alum.mit.edu>

LICENSE

       GNU General Public License

                                           April, 2011                               uni2ascii(1)