bionic (1) kcc.1.gz

Provided by: kcc_2.3-12.1build1_amd64 bug

NAME

       kcc - Kanji code coverter with encoding auto detection

SYNOPSIS

       kcc [ -IOchnvxz ] [ -b bufsize ] [ file ] ...

DESCRIPTION

       kcc  is a filter that reads file sequencially, converts kanji encodings and output to stdout.  If no file
       is specified, or specified - as filename, it read from  stdin.   You  can  specify  kanji  encodings  for
       input/output. However, kcc detect input encodig automatically, if you don't specify input encoding.

       Available kanji encodings are JIS (7 bit and/or 8 bit), Shift JISEUCDEC.  For input encoding, you can mix
       when these are pair of one of EUC DEC or Shift JIS and 7 bit JIS.  SI/SOESC(I are recognized as halfwidth
       of JIS.

OPTIONS

       -O
       -IO    I  for  input  kanji  encoding¡¤O for output kanji encoding.  When no input encoding specified, it
              will be detected automatically, and if both of input/output aren't specified, output encoding is 7
              bit JIS.

              You can specify one of the followings for the input encoding option, I.

                 e      EUC(available with 7 bit JIS )
                 d      DEC(available with 7 bit JIS )
                 s      Shift JIS(available with 7 bit JIS )
                 j7 or k
                        7 bit JIS
                 8      8 bit JIS

              You can specify one of the followings for output encoding option, O.

                 e      EUC
                 d      DEC
                 s      Shift JIS
                 jXY or 7XY
                        7 bit JIS(usingSI/SO for JIS kana designation)
                 kXY    7 bit JIS(usingESC(I for JIS kana designation)
                 8XY    8 bit JIS

              By  XY  in  O  option, You can specify which escape sequence used in JIS encoding.  BJ is default.
              Supplimental kanji designation is fixed to ESC$(D

                 X      Kanji is designated by:
                      B      ESC$B(JIS X0208-1983)
                      @      ESC$@(JIS X0208-1978)
                      +      ESC&@ESC$B(JIS X0212-1990)
                 Y      Alpha Numerical is designated by:
                      B      ESC(B(ASCII)
                      J      ESC(J(JIS Roman; JIS X0201)
                      H      ESC(H(Swedish; strongly deprecated)

       -v     outputs result of input encoding detection to stderr.

       -x     Extension mode.  By auto detection of  input  encodings,  recognize  user-defined  characters  and
              extended  character  region ( out of range of EUC, undefined halfwidth kana, control character, C1
              area and/or extended character region Shift C1 JIS ). Distinguish between DEC and EUC is  done  in
              this mode.

       -z     Shrink  mode.  Don't  recognize  halfwidth kana (except 7 bit JIS ) with input encoding detection.
              With this option, accuracy of auto detection of input  encodings  becomes  much  better  for  file
              without halfwidth kana.

       -h     Normally, When converted halfwidth kana to DEC , it becomes fullwidth Katakana.  With this option,
              it becomes Hiragana.

       -n     user-defined characters, extended characters and supplimental  kanji  characters  areconverted  to
              fullwidth  white  box,  and undefined region of halfwidth kana are converted to halfwidth centered
              dot.

       -b bufsize
              specify buffer size.  8kbytes is default.

       -c     don't convert but check input encoding and print result to stdout.  Different  with  normal  auto-
              detection,  whole contents of file is checked.  However, when inconsistency of encodings is found,
              abort reading and print "data".  Options except -x¡¤-z are ignored.

EXAMPLES

       % kcc -e file
              Input encoding are detect automatically, and output is in EUC encoding.

       % kcc -sj file1 file2
              Two files in Shift JIS concatinated with converting to JIS.

       % command | kcc -k+J
              output of command are converted to JIS(JIS JIS X0208 JIS JIS Roman¡¤ESC(I Halfwidth Kana JIS )

       % kcc -c file
              Encoding of contents of file is detected(no conversion)

BUG

       Auto detection of input encoding is well done for normal case, however, it has the following problems.

       7 bit JIS is recognized by escape sequence in certain.  EUC and DEC are the same (refered as EUC series).
       Halfwidth  kana  of  8  bit JIS is the same as halfwidth kana of Shift JIS (refered as Shift JIS series).
       However, EUC series and JIS , which are both 8 bit encoding, are sharing the same  regions  widely.   So,
       the problem in auto detection is detection of these 2 encodings.

       Detection  of  EUC  series/Shift JIS series is done in line by line, When it is found that it's not Shift
       JIS series, or it's not EUC series, encoding is determined.  When inconsistensy found, it will be treated
       as "data" and contents of output is not guaranteed.

       While  determined between EUC series/Shift JIS series after 8bit code found,  conversions are pending and
       put input data in buffer,  however, buffer is fulled, it assumes it's EUC  series  and  forces  to  start
       conversion.  Rationale.  Usually,  we  can  assume that documents with kanji include JIS non-kanji or JIS
       first standard, it can be detected in certain if it is Shift JIS , which does not share region with  EUC.
       So if it can't be determined, it's very likely to be EUC.

       8  bit JIS and it has always even number of halfwidth kana sequences, then it will be wrongly detected as
       EUC kanji. Be ceraful.

       If input encoding doesn't have halfwidth kana, use -z and accuracy of detection become much better.  This
       is because shared region are restricted to area of JIS second standards.

       Extended  region of Shift JIS user-defined area of EUC, control characters C1 of EUC, undefined region of
       halfwidth kana of EUC are out of range of auto detection, so it will fails to detect encodings  if  input
       has these characters.  Use -x option to specify extended mode, or specify input code.

SEE ALSO

       cat(1)

NOTES

       Usually,   user-defined  characters,  extended  characters,  supplimental  kanji  characters  are  mapped
       respectively. However characters that is out of range of extended characters become FCFC  in  hexadecimal
       when  converted to Shift JIS.  Although control character region C1 of EUC and DEC remains when converted
       to JIS , these will be deleted when converted to Shift  JIS  Undefined  area  of  halfwidth  kana  become
       halfwidth  centered dot when convered to Shift JIS Halfwidth kana become fullwidth kana when converted to
       DEC.

       When output is JIS encoding, control characters such as newline, TAB, DEL  and  white  space  (halfwidth)
       will be output in ASCII mode.

       When  encoding  of  input  is detected wrongly, or input undefined character for expected character sets,
       output is indefined.

       This manual are translated by Fumitoshi UKAI <ukai@debian.or.jp> for Debian system, but you  can  use  it
       for any purpose.