lunar (1) kcc.1.gz

Provided by: kcc_2.3+really-0.1_amd64 bug

NAME

       kcc - Kanji code coverter with encoding auto detection

SYNOPSIS

       kcc [ -IOchnvxz ] [ -b bufsize ] [ file ] ...

DESCRIPTION

       kcc  is  a  filter  that  reads  file sequentially, converts kanji encodings and output to
       stdout.  If no file is specified, or specified - as filename, it read from stdin.  You can
       specify kanji encodings for input/output. However, kcc detect input encodig automatically,
       if you don't specify input encoding.

       Available kanji encodings are JIS (7 bit  and/or  8  bit),  Shift  JISEUCDEC.   For  input
       encoding,  you  can  mix when these are pair of one of EUC DEC or Shift JIS and 7 bit JIS.
       SI/SOESC(I are recognized as halfwidth of JIS.

OPTIONS

       -O
       -IO    I for input kanji encoding,O for output kanji encoding.  When  no  input  encoding
              specified,  it  will  be detected automatically, and if both of input/output aren't
              specified, output encoding is 7 bit JIS.

              You can specify one of the followings for the input encoding option, I.

                 e      EUC(available with 7 bit JIS )
                 d      DEC(available with 7 bit JIS )
                 s      Shift JIS(available with 7 bit JIS )
                 j7 or k
                        7 bit JIS
                 8      8 bit JIS

              You can specify one of the followings for output encoding option, O.

                 e      EUC
                 d      DEC
                 s      Shift JIS
                 jXY or 7XY
                        7 bit JIS(usingSI/SO for JIS kana designation)
                 kXY    7 bit JIS(usingESC(I for JIS kana designation)
                 8XY    8 bit JIS

              By XY in O option, You can specify which escape sequence used in JIS encoding.   BJ
              is default.   Supplimental kanji designation is fixed to ESC$(D

                 X      Kanji is designated by:
                      B      ESC$B(JIS X0208-1983)
                      @      ESC$@(JIS X0208-1978)
                      +      ESC&@ESC$B(JIS X0212-1990)
                 Y      Alpha Numerical is designated by:
                      B      ESC(B(ASCII)
                      J      ESC(J(JIS Roman; JIS X0201)
                      H      ESC(H(Swedish; strongly deprecated)

       -v     outputs result of input encoding detection to stderr.

       -x     Extension  mode.   By  auto  detection  of  input encodings, recognize user-defined
              characters and extended character region ( out of range of EUC, undefined halfwidth
              kana,  control  character, C1 area and/or extended character region Shift C1 JIS ).
              Distinguish between DEC and EUC is done in this mode.

       -z     Shrink mode. Don't recognize halfwidth kana (except 7 bit JIS ) with input encoding
              detection.  With this option, accuracy of auto detection of input encodings becomes
              much better for file without halfwidth kana.

       -h     Normally, When converted halfwidth kana to DEC ,  it  becomes  fullwidth  Katakana.
              With this option, it becomes Hiragana.

       -n     user-defined  characters,  extended  characters  and  supplimental kanji characters
              areconverted to fullwidth white box, and undefined region  of  halfwidth  kana  are
              converted to halfwidth centered dot.

       -b bufsize
              specify buffer size.  8kbytes is default.

       -c     don't  convert but check input encoding and print result to stdout.  Different with
              normal  auto-detection,   whole  contents  of  file  is  checked.   However,   when
              inconsistency  of  encodings  is  found,  abort  reading and print "data".  Options
              except -x-z are ignored.

EXAMPLES

       % kcc -e file
              Input encoding are detect automatically, and output is in EUC encoding.

       % kcc -sj file1 file2
              Two files in Shift JIS concatinated with converting to JIS.

       % command | kcc -k+J
              output of command are converted to JIS(JIS JIS X0208 JIS JIS Roman,ESC(I Halfwidth
              Kana JIS )

       % kcc -c file
              Encoding of contents of file is detected(no conversion)

BUG

       Auto  detection  of  input  encoding  is  well  done  for normal case, however, it has the
       following problems.

       7 bit JIS is recognized by escape sequence in certain.  EUC and DEC are the same (referred
       as  EUC  series).   Halfwidth kana of 8 bit JIS is the same as halfwidth kana of Shift JIS
       (referred as Shift JIS series).  However, EUC series and  JIS  ,  which  are  both  8  bit
       encoding,  are  sharing  the  same  regions  widely.  So, the problem in auto detection is
       detection of these 2 encodings.

       Detection of EUC series/Shift JIS series is done in line by line, When it  is  found  that
       it's  not  Shift  JIS  series,  or  it's  not  EUC  series,  encoding is determined.  When
       inconsistensy found, it  will  be  treated  as  "data"  and  contents  of  output  is  not
       guaranteed.

       While  determined  between EUC series/Shift JIS series after 8bit code found,  conversions
       are pending and put input data in buffer,  however, buffer is fulled, it assumes it's  EUC
       series  and  forces  to start conversion. Rationale. Usually, we can assume that documents
       with kanji include JIS non-kanji or JIS first standard, it can be detected in  certain  if
       it  is  Shift  JIS  , which does not share region with EUC.  So if it can't be determined,
       it's very likely to be EUC.

       8 bit JIS and it has always even number of halfwidth  kana  sequences,  then  it  will  be
       wrongly detected as EUC kanji. Be ceraful.

       If  input  encoding  doesn't  have halfwidth kana, use -z and accuracy of detection become
       much better.  This is  because  shared  region  are  restricted  to  area  of  JIS  second
       standards.

       Extended  region  of  Shift  JIS  user-defined  area of EUC, control characters C1 of EUC,
       undefined region of halfwidth kana of EUC are out of range of auto detection, so  it  will
       fails  to  detect  encodings  if  input  has  these  characters.  Use -x option to specify
       extended mode, or specify input code.

SEE ALSO

       cat(1)

NOTES

       Usually, user-defined characters, extended characters, supplimental kanji  characters  are
       mapped respectively. However characters that is out of range of extended characters become
       FCFC in hexadecimal when converted to Shift JIS.  Although control character region C1  of
       EUC  and DEC remains when converted to JIS , these will be deleted when converted to Shift
       JIS Undefined area of halfwidth kana become halfwidth centered dot when convered to  Shift
       JIS Halfwidth kana become fullwidth kana when converted to DEC.

       When  output is JIS encoding, control characters such as newline, TAB, DEL and white space
       (halfwidth) will be output in ASCII mode.

       When encoding of input is detected wrongly, or  input  undefined  character  for  expected
       character sets, output is indefined.

       This  manual  are  translated by Fumitoshi UKAI <ukai@debian.or.jp> for Debian system, but
       you can use it for any purpose.