Ubuntu Manpage: uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII representations

NAME

       uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII representations

SYNOPSIS

       uni2ascii [options] (<input file name>)

DESCRIPTION

uni2ascii converts UTF-8 Unicode to various 7-bit ASCII representations. If no format is
specified, standard hexadecimal format (e.g. 0x00e9) is used. It reads from the standard
input and writes to the standard output.

Command line options are:

-A List the single character approximations carried out by the -y flag.

-a <format>
Convert to the specified format. Formats may be specified by means of the following
arbitrary single character codes, by means of names such as "SGML_decimal", and by
examples of the desired format.

A Generate hexadecimal numbers with prefix U in angle-brackets (<U00E9>).

B Generate \x-escaped hex (e.g. \x00E9)

C Generate \x escaped hexadecimal numbers in braces (e.g. \x{00E9}).

D Generate decimal HTML numeric character references (e.g. &#0233;)

E Generate hexadecimal with prefix U (U00E9).

F Generate hexadecimal with prefix u (u00E9).

G Convert hexadecimal in single quotes with prefix X (e.g. X'00E9').

H Generate hexadecimal HTML numeric character references (e.g. &#x00E9;)

I Generate hexadecimal UTF-8 with each byte's hex preceded by an =-sign (e.g.
=C3=A9) . This is the Quoted Printable format defined by RFC 2045.

J Generate hexadecimal UTF-8 with each byte's hex preceded by a %-sign (e.g.
%C3%A9). This is the URI escape format defined by RFC 2396.

K Generate octal UTF-8 with each byte escaped by a backslash (e.g. \303\251)

L Generate \U-escaped hex outside the BMP, \u-escaped hex within the BMP
(U+0000-U+FFFF).

M Generate hexadecimal SGML numeric character references (e.g. \#xE9;)

N Generate decimal SGML numeric character references (e.g. \#233;)

O Generate octal escapes for the three low bytes in big-endian order(e.g.
\000\000\351))

P Generate hexadecimal numbers with prefix U+ (e.g. U+00E9)

Q Generate character entities (e.g. &eacute;) where possible, otherwise hexadecimal
numeric character references.

R Generate raw hexadecimal numbers (e.g. 00E9)

S Generate hexadecimal escapes for the three low bytes in big-endian order (e.g.
\x00\x00\xE9)

T Generate decimal escapes for the three low bytes in big-endian order (e.g.
\d000\d000\d233)

U Generate \u-escaped hexadecimal numbers (e.g. \u00E9).

V Generate \u-escaped decimal numbers (e.g. \u00233).

X Generate standard hexadecimal numbers (e.g. 0x00E9).

0 Generate hexadecimal UTF-8 with each byte's hex enclosed within angle brackets
(e.g. <C3><A9>).

1 Generate Common Lisp format hexadecimal numbers (e.g. #x00E9).

2 Generate Perl format decimal numbers with prefix v (e.g. v233).

3 Generate hexadecimal numbers with prefix $ (e.g. $00E9).

4 Generate Postscript format hexadecimal numbers with prefix 16# (e.g. 16#00E9).

5 Generate Common Lisp format hexadecimal numbers with prefix #16r (e.g. #16r00E9).

6 Generate ADA format hexadecimal numbers with prefix 16# and suffix # (e.g.
16#00E9#).

7 Generate Apache log format hexadecimal UTF-8 with each byte's hex preceded by a
backslash-x (e.g. \xC3\xA9).

8 Generate Microsoft OOXML format hexadecimal numbers with prefix _x and suffix _
(e.g. _x00E9_).

9 Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9).

-B Transform to ASCII if possible. This option is equivalent to the combination cdefx.

-c Convert circled and parenthesized characters to their unenclosed counterparts.

-d Strip diacritics. This converts single codepoints representing characters with
diacritics to the corresponding ASCII character and deletes separately encoded
diacritics.

-e Convert characters to their approximate ASCII equivalents, as follows:
U+0085 next line 0x0A newline
U+00A0 no break space 0x20 space
U+00AB left-pointing double angle quotation mark 0x22 double quote
U+00AD soft hyphen 0x2D minus
U+00AF macron 0x2D minus
U+00B7 middle dot 0x2E period
U+00BB right-pointing double angle quotation mark 0x22 double quote
U+1361 ethiopic word space 0x20 space
U+1680 ogham space 0x20 space
U+2000 en quad 0x20 space
U+2001 em quad 0x20 space
U+2002 en space 0x20 space
U+2003 em space 0x20 space
U+2004 three-per-em space 0x20 space
U+2005 four-per-em space 0x20 space
U+2006 six-per-em space 0x20 space
U+2007 figure space 0x20 space
U+2008 punctuation space 0x20 space
U+2009 thin space 0x20 space
U+200A hair space 0x20 space
U+200B zero-width space 0x20 space
U+2010 hyphen 0x2D minus
U+2011 non-breaking hyphen 0x2D minus
U+2012 figure dash 0x2D minus
U+2013 en dash 0x2D minus
U+2014 em dash 0x2D minus
U+2018 left single quotation mark 0x60 left single quote
U+2019 right single quotation mark 0x27 right or neutral single
quote
U+201A single low-9 quotation mark 0x60 left single quote
U+201B single high-reversed-9 quotation mark 0x60 left single quote
U+201C left double quotation mark 0x22 double quote
U+201D right double quotation mark 0x22 double quote
U+201E double low-9 quotation mark 0x22 double quote
U+201F double high-reversed-9 quotation mark 0x22 double quote
U+2022 bullet 0x6F small letter o
U+2028 line separator 0x0A newline
U+2033 double prime 0x22 double quote
U+2039 single left-pointing angle quotation mark 0x60 left single quote
U+203A single right-pointing angle quotation mark 0x27 right or neutral single
quote
U+204E low asterisk 0x2A asterisk
U+2212 minus sign 0x2D minus
U+2216 set minus 0x5C backslash
U+2217 asterisk operator 0x2A asterisk
U+2223 divides 0x7C vertical line
U+2500 box drawing light horizontal 0x2D minus
U+2501 box drawing heavy horizontal 0x2D minus
U+2502 box drawing light vertical 0x7C vertical line
U+2503 box drawing heavy vertical 0x7C vertical line
U+2731 heavy asterisk 0x2A asterisk
U+275D heavy double turned comma quotation mark 0x22 double quote
U+275E heavy double comma quotation mark 0x22 double quote
U+3000 ideographic space 0x20 space
U+FE60 small ampersand 0x26 ampersand
U+FE61 small asterisk 0x2A asterisk
U+FE62 small plus sign 0x2B plus sign

-E List the expansions performed by the -x flag.

-f Convert stylistic variants to plain ASCII. Stylistic equivalents include:
superscript and subscript forms, small capitals (e.g. U+1D04), script forms (e.g.
U+212C), black letter forms (e.g. U+212D), fullwidth forms (e.g. U+FF01), halfwidth
forms (e.g. U+FF7B), and the mathematical alphanumeric symbols (e.g. U+1D400).

-h Help. Print the usage message and exit.

-l Use lowercase a-f when generating hexadecimal numbers.

-n Convert newlines too. By default, they are left alone.

-P Pass through Unicode rather than converting to ASCII escapes if the character is
not converted to an ASCII character by a transformation such as diacritic
stripping. Note that if this option is used the output may not be pure ASCII.

-p Pure. Convert characters within the ASCII range except for space and newline as
well as those above.

-q Quiet. Do not chat unnecessarily while working.

-s Convert space characters too. By default, they are left alone.

-S <Unicode:ASCII>
Define a custom substitution. The argument should consist of the Unicode codepoint
to be replaced followed by the ASCII code of the character to be used as
replacement, separated by a colon. If no ASCII code follows the colon, the
specified Unicode character will be deleted. The code values may be in
hexadecimal, octal, or decimal following the usual conventions (to be precise,those
of strtoul(3)). This option may be repeated as many times as desired to define
multiple substitutions.

-v Print program version information and exit.

-w Add a space after each converted item.

-x Expand certain characters to multicharacter sequences. The characters affected are
the same as those affected by the -y option.
U+00A2 CENT SIGN -> cent
U+00A3 POUND SIGN -> pound
U+00A5 YEN SIGN -> yen
U+00A9 COPYRIGHT SYMBOL -> (c)
U+00AE REGISTERED SYMBOL -> (R)
U+00BC ONE QUARTER -> 1/4
U+00BD ONE HALF -> 1/2
U+00BE THREE QUARTERS -> 3/4
U+00C6 CAPITAL LETTER ASH -> AE
U+00DF SMALL LETTER SHARP S -> ss
U+00E6 SMALL LETTER ASH -> ae
U+0132 LIGATURE IJ -> IJ
U+0133 LIGATURE ij -> ij
U+0152 LIGATURE OE -> OE
U+0153 LIGATURE oe -> oe
U+01F1 CAPITAL LETTER DZ -> DZ
U+01F2 MIXED LETTER Dz -> Dz
U+01F3 SMALL LETTER DZ -> dz
U+02A6 SMALL LETTER TS DIGRAPH -> ts
U+2026 HORIZONTAL ELLIPSIS -> ...
U+20AC EURO SIGN -> euro
U+22EF MIDLINE HORIZONTAL ELLIPSIS -> ...
U+2190 LEFTWARDS ARROW -> <-
U+2192 RIGHTWARDS ARROW -> ->
U+21D0 LEFTWARDS DOUBLE ARROW -> <=
U+21D2 RIGHTWARDS DOUBLE ARROW -> =>
U+FB00 LATIN SMALL LIGATURE FF -> ff
U+FB01 LATIN SMALL LIGATURE FI -> fi
U+FB02 LATIN SMALL LIGATURE FL -> fl
U+FB03 LATIN SMALL LIGATURE FFI -> ffi
U+FB04 LATIN SMALL LIGATURE FFL -> ffl
U+FB06 LATIN SMALL LIGATURE ST -> st

-y Convert certain characters having multi-character expansions to single-character
ascii approximations instead (e.g. to maintain character-positioning). The
characters affected are the same as those affected by the -x option.
U+00A2 CENT SIGN -> c
U+00A3 POUND SIGN -> #
U+00A5 YEN SIGN -> Y
U+00A9 COPYRIGHT SYMBOL -> C
U+00AE REGISTERED SYMBOL -> R
U+00BC ONE QUARTER -> -
U+00BD ONE HALF -> -
U+00BE THREE QUARTERS -> -
U+00C6 CAPITAL LETTER ASH -> A
U+00DF SMALL LETTER SHARP S -> s
U+00E6 SMALL LETTER ASH -> a
U+0132 LIGATURE IJ -> I
U+0133 LIGATURE ij -> i
U+0152 LIGATURE OE -> O
U+0153 LIGATURE oe -> o
U+01F1 CAPITAL LETTER DZ -> D
U+01F2 MIXED LETTER Dz -> D
U+01F3 SMALL LETTER DZ -> d
U+02A6 SMALL LETTER TS DIGRAPH -> t
U+2026 HORIZONTAL ELLIPSIS -> .
U+20AC EURO SIGN -> E
U+22EF MIDLINE HORIZONTAL ELLIPSIS -> .
U+2190 LEFTWARDS ARROW -> <
U+2192 RIGHTWARDS ARROW -> >
U+21D0 LEFTWARDS DOUBLE ARROW -> <
U+21D2 RIGHTWARDS DOUBLE ARROW -> >

-Z <format>
Generate output using the supplied format. The format specified will be used as the
format string in a call to printf(3) with a single argument consisting of an
unsigned long integer. For example, to obtain the same output as with the -U flag,
the format would be: \u%04X.

If conversion of spaces is disabled (as it is by default), if space characters outside the
ASCII range are encountered (U+3000 ideographic space, U+1351 Ethiopic word space, and
U+1680 ogham space mark), they are replaced with the ASCII space character (0x20) so as to
keep the output pure 7-bit ASCII.

Note that XML and XHTML numeric character entities are like those of HTML with two
restrictions. First, in X(HT)ML the terminating semi-colon may not be omitted. Second, in
X(HT)ML the "x" must be lower-case, while in HTML it may be either upper- or lower-case.
We always generate the terminating semi-colon and use a lower-case "x", so the option
dubbed "HTML" produces valid XML and XHTML as well.

EXIT STATUS

       The following values are returned on exit:

       0 SUCCESS
              The input was successfully converted.

       2 I/O ERROR
              A system error ocurred during input or output.

       3 INFO The  user  requested  information  such as the version number or usage synopsis and
              this has been provided.

       5 BAD OPTION
              An incorrect option flag was given on the command line.

       8 BAD RECORD
              Ill-formed UTF-8 was detected in the input.

AUTHOR

       Bill Poser <billposer@alum.mit.edu>

LICENSE