Provided by: yudit_2.9.6-5_amd64 bug

NAME

       uniconv - convert text to native formats through unicode

SYNOPSIS

       uniconv  -out  output-file [ -decode input-encoding ] [ -encode output-encoding ] [ input-
       file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac ]

DESCRIPTION

       uniconv program decodes scripts with a certain  encoding  encodes  them  with  some  other
       encoding.   The scipt is a 16,8 or 7 bit-byte stream.  The converted text  will be sent to
       the standard output, even in case of 16-bit encodings,unless the output file is  specified
       by the -out option.

       The -decode and -encode options are optional, the default converter is utf-8.  The program
       reads the Unicode map helper files (*.my)  from  the  default  directory  /usr/share/data.
       Simple  1-to-1  encodings  can  be added on the fly by adding a a my-file, or setting your
       yudit.datapath         property          in          ~/.yudit/yudit.properties          or
       /usr/share/yudit/config/yudit.properties.  By default /usr/share/yudit/data is searched.

       My-files  can  be  created  by  a  program  called  The  files  can  be  converted between
       dos/unix/mac line-ending variants with -fromdos, -frommac,  -todos,  -tomac  options.  the
       default (not scpecified one) is Unix.  makeumap.

ENCODING

       If  you  received  this  program  through the Yudit distribution, then as of today you can
       convert between the encodings below.

       utf-8  Yudit recommends this format for international information  exchange.   ASCII  text
              will   get  through   intact, while other unicode characters will get their 8th bit
              set and the length  of  the  code  will depend on how far  away  they  are  in  the
              Unicode  space.  This is the only transformation format that can encode both 16-bit
              (ucs-2) and 31-bit (ucs-4) unicode.

       utf-8-s
              Hackers utf-8 format - it does not give an error message when a surrogate  pair  is
              decoded  and  it  can  encode  a surrogate pair 'as is'.  This is not a recommended
              encoding format although this format is used to encode/decode  clipboard  data,  in
              order to preserve input.

       utf-16 Although  16  is  bigger  than  8  this is still a compromise required by OSes like
              Windows that can not handle ucs-4 - this encoding produces 16-bit unicode  streams.
              In addition to BMP it can convert 16 planes using the Unicode Surrogate Area.  This
              encoding can not convert anything above U+10FFFF (Plane 16).  The input byte  order
              is recognized by the first two characters BEM (byte-order-mark) U+FEFF. This format
              is used in Windows NT for documents like notepad .txt files.

       utf-16-be
              Big endian utf-16 converter.

       utf-16-le
              Littlen endian utf-16 converter.

       utf-7  This is the recommended format for international information exchange,  when  7-bit
              can  only  be  used.  It  can only handle 16-bit (utf-16) unicode, for ucs-4 (above
              U+10FFFF) you should use utf-8 encoding.

       iso-8859-1
              This is the ISO 8859-1 character  encoding format. It is also  known  as  "Latin-1"
              encoding.

       iso-8859-2
              This   is   the  ISO 8859-2 character encoding format. It is also known as "Central
              European" encoding.

       iso-8859-5
              This is the ISO 8859-5 character encoding format. It is also  known  as  "Cyrillic"
              encoding.

       iso-8859-7
              This  is  the  ISO  8859-7  character  encoding format. It is also known as "Greek"
              encoding.

       iso-8859-9
              This is the ISO 8859-9 character encoding format. It is  also  known  as  "Turkish"
              encoding.

       koi8-r This is the KOI8-R character encoding format. It is mainly used in Russia.

       cp-1251
              This  is  the  CP1251  cyrillic  character  encoding  format.  It is mainly used in
              Microsoft Windows and some web sites.

       iso-2022-jp
              This is a Japanese character encoding format. It is a 7-bit encoding format.

       iso-2022-jp-3
              This is a Japanese character encoding format. It is a 7-bit encoding format. It  is
              base upon  JIS X 0213 standard.

       euc-jp This  is  a  Japanese  character  encoding  format. It is an 8-bit encoding format.
              Mainly used in UNIX systems.

       euc-jp-3
              The official name is EUC-JISX0213 - I just could not read this.  This is a Japanese
              character  encoding  format.  It is a 8-bit encoding format. It is base upon  JIS X
              0213 standard.

       shift-jis
              This is a Japanese character encoding format.  It  is  an  8-bit  encoding  format.
              Mainly used in MSDOS/Windows.

       shift-jis-3
              The  official  name  is  Shift_JISX0213  -  I  just could not read this.  This is a
              Japanese character encoding format.  It is an 8-bit encoding format. Mainly used in
              MSDOS/Windows.

       iso-2022-jp
              This is a Japanese 7-bit character encoding format.  The iso-2022-jp email messages
              can be decoded/encoded are in this format.

       iso-2022-x11
              This  is a Japanese character encoding format.  It is also known as "COMPOUND_TEXT"
              encoding  for  the  X   Window  System. This is a 7-bit encoding format.  It can be
              derived from the ISO 2022-JP format with some differences.

       ksc-5601-x11
              This  is  a   Korean   character   encoding   format   used   by   the   X   window
              system(COMPOUND_TEXT  encoding) to encode Korean(KS X 1001) and US-ASCII. This is a
              7bit encoding format compliant to ISO-2022 specification for encoding  of  multiple
              character  sets.   Please, note that this is DIFFERENT from ISO-2022-KR (defined in
              IETF RFC 1557).

       euc-kr This  is  an 8bit  multibyte encoding for Korean.   It  encodes  US-ASCII(7bit)  in
              single  byte  range  and characters in KS X 1001(formerly KS C 5601) in double byte
              range with MSB on(8bit). It's used in Unix and Internet. Korean  version of MS-DOS,
              MacOS  and  MS-Windows  use  compatible  (most  cases,  identical)  variant of this
              encoding.

       johab  This  is  a  Korean  encoding  specified  in  KS  X 1001(KS C  5601-1992),    Annex
              3   as   a  supplementary encoding.  Widely used in Korean MS-DOS until mid-1990's.
              It can  encode  all Hangul syllables(11,172) of modern Korean as well  as  all  the
              special symbols and Hanja (Chinese ideograms used in Korea) defined in KS X 1001.

       uhc    A  variant   of  EUC-KR  used  in  Korean  MS-Windows 95/98(proprietary encoding of
              Microsoft,CP949). Its character  repertoire  includes  all  modern   syllables   of
              Hangul,Korean    script  as  well  as  all  the  special symbols and Hanja (Chinese
              ideograms used in Korea) defined in KS X 1001.

       gb-18030
              This is a Chinese character encoding format based upon GB 18030.   It  encodes  the
              whole U+0000..U+10FFFF range, while being compatible with gb-2312.

       gb-2312-x11
              This  is  a  Chinese  character  encoding format based upon GB 2312.  It is a 7-bit
              encoding format.

       gb-2312
              This is a Chinese character encoding format based upon GB 2312.   It  is  an  8-bit
              encoding format.

       big-5  This  is  a  Chinese  character encoding format based upon BIG5 encoding.  It is an
              8-bit encoding format.

       hz     This is a Chinese character encoding format based upon "Hanzi" encoding.  It  is  a
              7-bit encoding format.

       viscii This is a Vietnamese character encoding format.

       ucs-2-be
              This  converts  16-bit unicode (ucs-2) streams. The format takes care of big-endian
              variant.  Yudit does not recommend this format.

       ucs-2-le
              This converts 16-bit unicode (ucs-2) streams. The  format  takes  care  of  little-
              endian variant.  Yudit does not recommend this format.

       ucs-2  This  converts  16-bit unicode (ucs-2) streams.  The input byte order is recognized
              by the first two characters BEM (byte-order-mark) U+FEFF.  Yudit does not recommend
              this format.

       java   This  converts \uxxxx character escapes. When encoding, all characters above U+0080
              will be escaped with a string like '\u0080'.  When  decoding  the  same  format  is
              decoded  but,  in addition, utf-8 format is also recognized, so it can also be used
              to recover data accidentally saved with the wrong enconding. The  U+10000..U+10FFFF
              area is converted to surrogates and vice versa.

       java-s This  converts \uxxxx character escapes. When encoding, all characters above U+0080
              will be escaped with a string like '\u0080'.  When  decoding  the  same  format  is
              decoded  but,  in addition, utf-8 format is also recognized, so it can also be used
              to recover data accidentally saved with the wrong  enconding.  Surrogates  are  not
              treated  specially  during  conversion  -  this  is  why  it  is  not  a recommened
              conversion.

FILES

       ~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties
              can have yudit.datapath property. This is where the map files are kept.  By default
              /usr/share/yudit/data is searched.

SEE ALSO

        makeumap

AUTHOR

       This program  was written by gsinai@yudit.org (Gaspar Sinai), Tokyo, 2 January, 2001.