Provided by: yudit_2.9.6-8build1_amd64 bug

NAME

       uniconv - convert text to native formats through unicode

SYNOPSIS

       uniconv -out output-file [ -decode input-encoding ] [ -encode output-encoding ] [ input-file ] [ -todos ]
       [ -fromdos ] [ -tomac ] [ -frommac ]

DESCRIPTION

       uniconv program decodes scripts with a certain encoding encodes them with some other encoding.  The scipt
       is a 16,8 or 7 bit-byte stream.  The converted text  will be sent to the standard output, even in case of
       16-bit encodings,unless the output file is specified by the -out option.

       The -decode and -encode options are optional, the default converter is  utf-8.   The  program  reads  the
       Unicode  map helper files (*.my) from the default directory /usr/share/data.  Simple 1-to-1 encodings can
       be  added  on  the  fly  by  adding  a  a  my-file,  or   setting   your   yudit.datapath   property   in
       ~/.yudit/yudit.properties  or /usr/share/yudit/config/yudit.properties.  By default /usr/share/yudit/data
       is searched.

       My-files can be created by a program called The files can be converted between  dos/unix/mac  line-ending
       variants  with  -fromdos,  -frommac,  -todos,  -tomac  options. the default (not scpecified one) is Unix.
       makeumap.

ENCODING

       If you received this program through the Yudit distribution, then as of today you can convert between the
       encodings below.

       utf-8  Yudit  recommends  this  format  for  international  information  exchange.  ASCII text  will  get
              through  intact, while other unicode characters will get their 8th bit set and the length  of  the
              code   will depend on how far away they are in the Unicode space.  This is the only transformation
              format that can encode both 16-bit (ucs-2) and 31-bit (ucs-4) unicode.

       utf-8-s
              Hackers utf-8 format - it does not give an error message when a surrogate pair is decoded  and  it
              can  encode  a  surrogate  pair  'as is'.  This is not a recommended encoding format although this
              format is used to encode/decode clipboard data, in order to preserve input.

       utf-16 Although 16 is bigger than 8 this is still a compromise required by OSes like Windows that can not
              handle  ucs-4  - this encoding produces 16-bit unicode streams.  In addition to BMP it can convert
              16 planes using the Unicode Surrogate Area.  This encoding can not convert anything above U+10FFFF
              (Plane  16).  The input byte order is recognized by the first two characters BEM (byte-order-mark)
              U+FEFF. This format is used in Windows NT for documents like notepad .txt files.

       utf-16-be
              Big endian utf-16 converter.

       utf-16-le
              Littlen endian utf-16 converter.

       utf-7  This is the recommended format for international information exchange,  when  7-bit  can  only  be
              used.  It can only handle 16-bit (utf-16) unicode, for ucs-4 (above U+10FFFF) you should use utf-8
              encoding.

       iso-8859-1
              This is the ISO 8859-1 character  encoding format. It is also known as "Latin-1" encoding.

       iso-8859-2
              This  is  the ISO 8859-2 character encoding  format.  It  is  also  known  as  "Central  European"
              encoding.

       iso-8859-5
              This is the ISO 8859-5 character encoding format. It is also known as "Cyrillic" encoding.

       iso-8859-7
              This is the ISO 8859-7 character encoding format. It is also known as "Greek" encoding.

       iso-8859-9
              This is the ISO 8859-9 character encoding format. It is also known as "Turkish" encoding.

       koi8-r This is the KOI8-R character encoding format. It is mainly used in Russia.

       cp-1251
              This  is the CP1251 cyrillic character encoding format. It is mainly used in Microsoft Windows and
              some web sites.

       iso-2022-jp
              This is a Japanese character encoding format. It is a 7-bit encoding format.

       iso-2022-jp-3
              This is a Japanese character encoding format. It is a 7-bit encoding format. It is base upon   JIS
              X 0213 standard.

       euc-jp This is a Japanese character encoding format. It is an 8-bit encoding format.  Mainly used in UNIX
              systems.

       euc-jp-3
              The official name is EUC-JISX0213 - I just could not read this.   This  is  a  Japanese  character
              encoding format. It is a 8-bit encoding format. It is base upon  JIS X 0213 standard.

       shift-jis
              This  is  a  Japanese  character  encoding format.  It is an 8-bit encoding format. Mainly used in
              MSDOS/Windows.

       shift-jis-3
              The official name is Shift_JISX0213 - I just could not read this.  This is  a  Japanese  character
              encoding format.  It is an 8-bit encoding format. Mainly used in MSDOS/Windows.

       iso-2022-jp
              This  is  a  Japanese  7-bit  character  encoding  format.   The iso-2022-jp email messages can be
              decoded/encoded are in this format.

       iso-2022-x11
              This  is a Japanese character encoding format.  It is also known as "COMPOUND_TEXT"  encoding  for
              the  X   Window  System.  This is a 7-bit encoding format.  It can be derived from the ISO 2022-JP
              format with some differences.

       ksc-5601-x11
              This is a  Korean  character  encoding format used by the X window system(COMPOUND_TEXT  encoding)
              to  encode  Korean(KS  X  1001) and US-ASCII. This is a 7bit encoding format compliant to ISO-2022
              specification for encoding of multiple character sets.  Please, note that this is  DIFFERENT  from
              ISO-2022-KR (defined in IETF RFC 1557).

       euc-kr This   is  an 8bit  multibyte encoding for Korean.  It encodes US-ASCII(7bit) in single byte range
              and characters in KS X 1001(formerly KS C 5601) in double byte range with MSB on(8bit). It's  used
              in  Unix and Internet. Korean  version of MS-DOS, MacOS and MS-Windows use compatible (most cases,
              identical) variant of this encoding.

       johab  This  is  a  Korean  encoding  specified  in  KS  X  1001(KS  C  5601-1992),    Annex   3   as   a
              supplementary  encoding.   Widely  used  in  Korean  MS-DOS until mid-1990's.  It can  encode  all
              Hangul syllables(11,172) of modern Korean as well as all the special symbols  and  Hanja  (Chinese
              ideograms used in Korea) defined in KS X 1001.

       uhc    A   variant    of    EUC-KR    used    in    Korean    MS-Windows  95/98(proprietary  encoding  of
              Microsoft,CP949). Its character  repertoire  includes  all  modern   syllables   of  Hangul,Korean
              script  as  well as all the special symbols and Hanja (Chinese ideograms used in Korea) defined in
              KS X 1001.

       gb-18030
              This is a  Chinese  character  encoding  format  based  upon  GB  18030.   It  encodes  the  whole
              U+0000..U+10FFFF range, while being compatible with gb-2312.

       gb-2312-x11
              This is a Chinese character encoding format based upon GB 2312.  It is a 7-bit encoding format.

       gb-2312
              This is a Chinese character encoding format based upon GB 2312.  It is an 8-bit encoding format.

       big-5  This  is  a  Chinese  character encoding format based upon BIG5 encoding.  It is an 8-bit encoding
              format.

       hz     This is a Chinese character encoding format based upon "Hanzi" encoding.  It is a  7-bit  encoding
              format.

       viscii This is a Vietnamese character encoding format.

       ucs-2-be
              This  converts 16-bit unicode (ucs-2) streams. The format takes care of big-endian variant.  Yudit
              does not recommend this format.

       ucs-2-le
              This converts 16-bit unicode (ucs-2) streams. The format  takes  care  of  little-endian  variant.
              Yudit does not recommend this format.

       ucs-2  This converts 16-bit unicode (ucs-2) streams.  The input byte order is recognized by the first two
              characters BEM (byte-order-mark) U+FEFF.  Yudit does not recommend this format.

       java   This converts \uxxxx character escapes. When encoding, all characters above U+0080 will be escaped
              with  a  string  like  '\u0080'.  When decoding the same format is decoded but, in addition, utf-8
              format is also recognized, so it can also be used to recover  data  accidentally  saved  with  the
              wrong enconding. The U+10000..U+10FFFF area is converted to surrogates and vice versa.

       java-s This converts \uxxxx character escapes. When encoding, all characters above U+0080 will be escaped
              with a string like '\u0080'. When decoding the same format is  decoded  but,  in  addition,  utf-8
              format  is  also  recognized,  so  it can also be used to recover data accidentally saved with the
              wrong enconding. Surrogates are not treated specially during conversion - this is why it is not  a
              recommened conversion.

FILES

       ~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties
              can   have   yudit.datapath  property.  This  is  where  the  map  files  are  kept.   By  default
              /usr/share/yudit/data is searched.

SEE ALSO

        makeumap

AUTHOR

       This program  was written by gsinai@yudit.org (Gaspar Sinai), Tokyo, 2 January, 2001.