lunar (1) uniconv.1.gz

Provided by: yudit_3.1.0-1_amd64 bug

NAME

       uniconv - convert text to native formats through Unicode

SYNOPSIS

       uniconv  -out  output-file [ -decode input-encoding ] [ -encode output-encoding ] [ input-
       file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac ]

DESCRIPTION

       uniconv program decodes scripts with a certain  encoding  encodes  them  with  some  other
       encoding.   The scipt is a 16,8 or 7 bit-byte stream.  The converted text  will be sent to
       the standard output, even in case of 16-bit encoding methods,unless  the  output  file  is
       specified by the -out option.

       The -decode and -encode options are optional, the default converter is utf-8.  The program
       reads the Unicode map helper files (*.my)  from  the  default  directory  /usr/share/data.
       Simple  1-to-1  encoding methods can be added on the fly by adding a a my-file, or setting
       your       yudit.datapath       property       in       ~/.yudit/yudit.properties       or
       /usr/share/yudit/config/yudit.properties.     By    default    /usr/share/yudit/data   and
       ~/.yudit/data are searched.

       My-files can  be  created  by  a  program  called  The  files  can  be  converted  between
       dos/unix/mac  line-ending  variants  with  -fromdos, -frommac, -todos, -tomac options. the
       default (not scpecified one) is Unix.  makeumap.

ENCODING

       If you received this program through the Yudit distribution, then  as  of  today  you  can
       convert between the encoding methods below.

       utf-8  Yudit  recommends  this  format for international information exchange.  ASCII text
              will  get through  intact, while other Unicode characters will get  their  8th  bit
              set  and  the  length   of   the  code  will depend on how far away they are in the
              Unicode space.  This is the only transformation format that can encode both  16-bit
              (ucs-2) and 31-bit (ucs-4) Unicode.

       utf-8-s
              Hackers  utf-8  format - it does not give an error message when a surrogate pair is
              decoded and it can encode a surrogate pair 'as is'.   This  is  not  a  recommended
              encoding  format  although  this format is used to encode/decode clipboard data, in
              order to preserve input.

       utf-16 Although 16 is bigger than 8 this is still  a  compromise  required  by  OSes  like
              Windows  that can not handle ucs-4 - this encoding produces 16-bit Unicode streams.
              In addition to BMP it can convert 16 planes using the Unicode Surrogate Area.  This
              encoding  can not convert anything above U+10FFFF (Plane 16).  The input byte order
              is recognized by the first two characters BEM (byte-order-mark) U+FEFF. This format
              is used in Windows NT for documents like notepad .txt files.

       utf-16-be
              Big endian utf-16 converter.

       utf-16-le
              Littlen endian utf-16 converter.

       utf-7  This  is  the recommended format for international information exchange, when 7-bit
              can only be used. It can only handle 16-bit  (utf-16)  Unicode,  for  ucs-4  (above
              U+10FFFF) you should use utf-8 encoding.

       iso-8859-1
              This  is  the  ISO 8859-1 character  encoding format. It is also known as "Latin-1"
              encoding.

       iso-8859-2
              This  is  the ISO 8859-2 character encoding format. It is also  known  as  "Central
              European" encoding.

       iso-8859-5
              This  is  the  ISO 8859-5 character encoding format. It is also known as "Cyrillic"
              encoding.

       iso-8859-7
              This is the ISO 8859-7 character encoding format.  It  is  also  known  as  "Greek"
              encoding.

       iso-8859-9
              This  is  the  ISO  8859-9 character encoding format. It is also known as "Turkish"
              encoding.

       koi8-r This is the KOI8-R character encoding format. It is mainly used in Russia.

       cp-1251
              This is the CP1251 cyrillic  character  encoding  format.  It  is  mainly  used  in
              Microsoft Windows and some web sites.

       iso-2022-jp
              This is a Japanese character encoding format. It is a 7-bit encoding format.

       iso-2022-jp-3
              This  is a Japanese character encoding format. It is a 7-bit encoding format. It is
              base upon  JIS X 0213 standard.

       euc-jp This is a Japanese character encoding format.  It  is  an  8-bit  encoding  format.
              Mainly used in UNIX systems.

       euc-jp-3
              The official name is EUC-JISX0213 - I just could not read this.  This is a Japanese
              character encoding format. It is a 8-bit encoding format. It is base  upon   JIS  X
              0213 standard.

       shift-jis
              This  is  a  Japanese  character  encoding format.  It is an 8-bit encoding format.
              Mainly used in MSDOS/Windows.

       shift-jis-3
              The official name is Shift_JISX0213 - I just  could  not  read  this.   This  is  a
              Japanese character encoding format.  It is an 8-bit encoding format. Mainly used in
              MSDOS/Windows.

       iso-2022-jp
              This is a Japanese 7-bit character encoding format.  The iso-2022-jp email messages
              can be decoded/encoded are in this format.

       iso-2022-x11
              This  is a Japanese character encoding format.  It is also known as "COMPOUND_TEXT"
              encoding for the X  Window System. This is a 7-bit  encoding  format.   It  can  be
              derived from the ISO 2022-JP format with some differences.

       ksc-5601-x11
              This   is   a    Korean    character    encoding   format  used  by  the  X  window
              system(COMPOUND_TEXT encoding) to encode Korean(KS X 1001) and US-ASCII. This is  a
              7bit  encoding  format compliant to ISO-2022 specification for encoding of multiple
              character sets.  Please, note that this is DIFFERENT from ISO-2022-KR  (defined  in
              IETF RFC 1557).

       euc-kr This   is   an  8bit   multibyte encoding for Korean.  It encodes US-ASCII(7bit) in
              single byte range and characters in KS X 1001(formerly KS C 5601)  in  double  byte
              range with MSB on(8bit). It's used in Unix and Internet. Korean  version of MS-DOS,
              MacOS and MS-Windows  use  compatible  (most  cases,  identical)  variant  of  this
              encoding.

       johab  This   is   a  Korean  encoding  specified  in  KS  X 1001(KS C 5601-1992),   Annex
              3  as  a supplementary encoding.  Widely used in Korean  MS-DOS  until  mid-1990's.
              It  can   encode   all Hangul syllables(11,172) of modern Korean as well as all the
              special symbols and Hanja (Chinese ideograms used in Korea) defined in KS X 1001.

       uhc    A variant  of  EUC-KR  used  in  Korean  MS-Windows 95/98(proprietary  encoding  of
              Microsoft,CP949).  Its  character  repertoire  includes  all  modern  syllables  of
              Hangul,Korean   script as well as  all  the  special  symbols  and  Hanja  (Chinese
              ideograms used in Korea) defined in KS X 1001.

       gb-18030
              This  is  a  Chinese character encoding format based upon GB 18030.  It encodes the
              whole U+0000..U+10FFFF range, while being compatible with gb-2312.

       gb-2312-x11
              This is a Chinese character encoding format based upon GB  2312.   It  is  a  7-bit
              encoding format.

       gb-2312
              This  is  a  Chinese  character encoding format based upon GB 2312.  It is an 8-bit
              encoding format.

       big-5  This is a Chinese character encoding format based upon BIG5  encoding.   It  is  an
              8-bit encoding format.

       hz     This  is  a Chinese character encoding format based upon "Hanzi" encoding.  It is a
              7-bit encoding format.

       viscii This is a Vietnamese character encoding format.

       ucs-2-be
              This converts 16-bit Unicode (ucs-2) streams. The format takes care  of  big-endian
              variant.  Yudit does not recommend this format.

       ucs-2-le
              This  converts  16-bit  Unicode  (ucs-2)  streams. The format takes care of little-
              endian variant.  Yudit does not recommend this format.

       ucs-2  This converts 16-bit Unicode (ucs-2) streams.  The input byte order  is  recognized
              by the first two characters BEM (byte-order-mark) U+FEFF.  Yudit does not recommend
              this format.

       java   This converts \uxxxx character escapes. When encoding, all characters above  U+0080
              will  be  escaped  with  a  string  like '\u0080'. When decoding the same format is
              decoded but, in addition, utf-8 format is also recognized, so it can also  be  used
              to  recover  data accidentally saved with the wrong encoding. The U+10000..U+10FFFF
              area is converted to surrogates and vice versa.

       java-s This converts \uxxxx character escapes. When encoding, all characters above  U+0080
              will  be  escaped  with  a  string  like '\u0080'. When decoding the same format is
              decoded but, in addition, utf-8 format is also recognized, so it can also  be  used
              to  recover  data  accidentally  saved  with the wrong encoding. Surrogates are not
              treated specially during  conversion  -  this  is  why  it  is  not  a  recommended
              conversion.

FILES

       ~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties
              can have yudit.datapath property. This is where the map files are kept.  By default
              /usr/share/yudit/data is searched.

SEE ALSO

        makeumap

AUTHOR

       This program  was written by gaspar@yudit.org (Gaspar Sinai), Last  updated:  5  February,
       2023, Tokyo.