Ubuntu Manpage: uniconv - convert text to native formats through unicode

NAME

       uniconv - convert text to native formats through unicode

SYNOPSIS

       uniconv  -out  output-file [ -decode input-encoding ] [ -encode output-encoding ] [ input-
       file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac ]

DESCRIPTION

       uniconv program decodes scripts with a certain  encoding  encodes  them  with  some  other
       encoding.   The scipt is a 16,8 or 7 bit-byte stream.  The converted text  will be sent to
       the standard output, even in case of 16-bit encodings,unless the output file is  specified
       by the -out option.

       The -decode and -encode options are optional, the default converter is utf-8.  The program
       reads the Unicode map helper files (*.my)  from  the  default  directory  /usr/share/data.
       Simple  1-to-1  encodings  can  be added on the fly by adding a a my-file, or setting your
       yudit.datapath         property          in          ~/.yudit/yudit.properties          or
       /usr/share/yudit/config/yudit.properties.  By default /usr/share/yudit/data is searched.

       My-files  can  be  created  by  a  program  called  The  files  can  be  converted between
       dos/unix/mac line-ending variants with -fromdos, -frommac,  -todos,  -tomac  options.  the
       default (not scpecified one) is Unix.  makeumap.

ENCODING

If you received this program through the Yudit distribution, then as of today you can
convert between the encodings below.

utf-8 Yudit recommends this format for international information exchange. ASCII text
will get through intact, while other unicode characters will get their 8th bit
set and the length of the code will depend on how far away they are in the
Unicode space. This is the only transformation format that can encode both 16-bit
(ucs-2) and 31-bit (ucs-4) unicode.

utf-8-s
Hackers utf-8 format - it does not give an error message when a surrogate pair is
decoded and it can encode a surrogate pair 'as is'. This is not a recommended
encoding format although this format is used to encode/decode clipboard data, in
order to preserve input.

utf-16 Although 16 is bigger than 8 this is still a compromise required by OSes like
Windows that can not handle ucs-4 - this encoding produces 16-bit unicode streams.
In addition to BMP it can convert 16 planes using the Unicode Surrogate Area. This
encoding can not convert anything above U+10FFFF (Plane 16). The input byte order
is recognized by the first two characters BEM (byte-order-mark) U+FEFF. This format
is used in Windows NT for documents like notepad .txt files.

utf-16-be
Big endian utf-16 converter.

utf-16-le
Littlen endian utf-16 converter.

utf-7 This is the recommended format for international information exchange, when 7-bit
can only be used. It can only handle 16-bit (utf-16) unicode, for ucs-4 (above
U+10FFFF) you should use utf-8 encoding.

iso-8859-1
This is the ISO 8859-1 character encoding format. It is also known as "Latin-1"
encoding.

iso-8859-2
This is the ISO 8859-2 character encoding format. It is also known as "Central
European" encoding.

iso-8859-5
This is the ISO 8859-5 character encoding format. It is also known as "Cyrillic"
encoding.

iso-8859-7
This is the ISO 8859-7 character encoding format. It is also known as "Greek"
encoding.

iso-8859-9
This is the ISO 8859-9 character encoding format. It is also known as "Turkish"
encoding.

koi8-r This is the KOI8-R character encoding format. It is mainly used in Russia.

cp-1251
This is the CP1251 cyrillic character encoding format. It is mainly used in
Microsoft Windows and some web sites.

iso-2022-jp
This is a Japanese character encoding format. It is a 7-bit encoding format.

iso-2022-jp-3
This is a Japanese character encoding format. It is a 7-bit encoding format. It is
base upon JIS X 0213 standard.

euc-jp This is a Japanese character encoding format. It is an 8-bit encoding format.
Mainly used in UNIX systems.

euc-jp-3
The official name is EUC-JISX0213 - I just could not read this. This is a Japanese
character encoding format. It is a 8-bit encoding format. It is base upon JIS X
0213 standard.

shift-jis
This is a Japanese character encoding format. It is an 8-bit encoding format.
Mainly used in MSDOS/Windows.

shift-jis-3
The official name is Shift_JISX0213 - I just could not read this. This is a
Japanese character encoding format. It is an 8-bit encoding format. Mainly used in
MSDOS/Windows.

iso-2022-jp
This is a Japanese 7-bit character encoding format. The iso-2022-jp email messages
can be decoded/encoded are in this format.

iso-2022-x11
This is a Japanese character encoding format. It is also known as "COMPOUND_TEXT"
encoding for the X Window System. This is a 7-bit encoding format. It can be
derived from the ISO 2022-JP format with some differences.

ksc-5601-x11
This is a Korean character encoding format used by the X window
system(COMPOUND_TEXT encoding) to encode Korean(KS X 1001) and US-ASCII. This is a
7bit encoding format compliant to ISO-2022 specification for encoding of multiple
character sets. Please, note that this is DIFFERENT from ISO-2022-KR (defined in
IETF RFC 1557).

euc-kr This is an 8bit multibyte encoding for Korean. It encodes US-ASCII(7bit) in
single byte range and characters in KS X 1001(formerly KS C 5601) in double byte
range with MSB on(8bit). It's used in Unix and Internet. Korean version of MS-DOS,
MacOS and MS-Windows use compatible (most cases, identical) variant of this
encoding.

johab This is a Korean encoding specified in KS X 1001(KS C 5601-1992), Annex
3 as a supplementary encoding. Widely used in Korean MS-DOS until mid-1990's.
It can encode all Hangul syllables(11,172) of modern Korean as well as all the
special symbols and Hanja (Chinese ideograms used in Korea) defined in KS X 1001.

uhc A variant of EUC-KR used in Korean MS-Windows 95/98(proprietary encoding of
Microsoft,CP949). Its character repertoire includes all modern syllables of
Hangul,Korean script as well as all the special symbols and Hanja (Chinese
ideograms used in Korea) defined in KS X 1001.

gb-18030
This is a Chinese character encoding format based upon GB 18030. It encodes the
whole U+0000..U+10FFFF range, while being compatible with gb-2312.

gb-2312-x11
This is a Chinese character encoding format based upon GB 2312. It is a 7-bit
encoding format.

gb-2312
This is a Chinese character encoding format based upon GB 2312. It is an 8-bit
encoding format.

big-5 This is a Chinese character encoding format based upon BIG5 encoding. It is an
8-bit encoding format.

hz This is a Chinese character encoding format based upon "Hanzi" encoding. It is a
7-bit encoding format.

viscii This is a Vietnamese character encoding format.

ucs-2-be
This converts 16-bit unicode (ucs-2) streams. The format takes care of big-endian
variant. Yudit does not recommend this format.

ucs-2-le
This converts 16-bit unicode (ucs-2) streams. The format takes care of little-
endian variant. Yudit does not recommend this format.

ucs-2 This converts 16-bit unicode (ucs-2) streams. The input byte order is recognized
by the first two characters BEM (byte-order-mark) U+FEFF. Yudit does not recommend
this format.

java This converts \uxxxx character escapes. When encoding, all characters above U+0080
will be escaped with a string like '\u0080'. When decoding the same format is
decoded but, in addition, utf-8 format is also recognized, so it can also be used
to recover data accidentally saved with the wrong enconding. The U+10000..U+10FFFF
area is converted to surrogates and vice versa.

java-s This converts \uxxxx character escapes. When encoding, all characters above U+0080
will be escaped with a string like '\u0080'. When decoding the same format is
decoded but, in addition, utf-8 format is also recognized, so it can also be used
to recover data accidentally saved with the wrong enconding. Surrogates are not
treated specially during conversion - this is why it is not a recommened
conversion.

FILES

       ~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties
              can have yudit.datapath property. This is where the map files are kept.  By default
              /usr/share/yudit/data is searched.

AUTHOR