Ubuntu Manpage: kytea — a word segmentation/pronunciation estimation tool

NAME

       kytea — a word segmentation/pronunciation estimation tool

SYNOPSIS

       train-kytea [options]

DESCRIPTION

       This manual page documents briefly the train-kytea command.

       This manual page was written for the Debian distribution because the original program does
       not have a manual page.  Instead, it has documentation in the GNU Info format; see below.

       kytea is morphological analysis  system  based  on  pointwise  predictors.   It  separetes
       sentences  into  words,  tagging and predict pronunciations. The pronunciation of KyTea is
       same as cutie.

OPTIONS

       A summary of options is included below.

   Input/Output Options:
       -encode    The text encoding to be used (utf8/euc/sjis; default: utf8)

       -full      A fully annotated training corpus (multiple possible)

       -tok       A training corpus that is tokenized with no tags (multiple possible)

       -part      A partially annotated training corpus (multiple possible)

       -conf      A confidence annotated training corpus (multiple possible)

       -feat      A file containing features generated by -featout

       -dict      A dictionary file (one 'word/pron' entry per line, multiple possible)

       -subword   A file of subword units. This will enable unknown word PE.

       -model     The file to write the trained model to

       -modtext   Print a text model (instead of the default binary)

       -featout   Write the features used in training the model to this file

   Model Training Options (basic)
       -nows      Don't train a word segmentation model

       -notags    Skip the training of tagging, do only word segmentation

       -global    Train the nth tag with a global model (good for POS, bad for PE)

       -debug     The debugging level during training (0=silent, 1=normal, 2=detailed)

   Model Training Options (for advanced users):
       -charw     The character window to use for WS (3)

       -charn     The character n-gram length to use for WS for WS (3)

       -typew     The character type window to use for WS (3)

       -typen     The character type n-gram length to use for WS for WS (3)

       -dictn     Dictionary words greater than -dictn will be grouped together (4)

       -unkn      Language model n-gram order for unknown words (3)

       -eps       The epsilon stopping criterion for classifier training

       -cost      The cost hyperparameter for classifier training

       -nobias    Don't use a bias value in classifier training

       -solver    The solver (1=SVM,  7=logistic  regression,  etc.;  default  1,  see  LIBLINEAR
                  documentation for more details)

   Format Options (for advanced users):
       -wordbound The separator for words in full annotation (" ")

       -tagbound  The separator for tags in full/partial annotation ("/")

       -elembound The separator for candidates in full/partial annotation ("&")

       -unkbound  Indicates unannotated boundaries in partial annotation (" ")

       -skipbound Indicates skipped boundaries in partial annotation ("?")

       -nobound   Indicates non-existence of boundaries in partial annotation ("-")

       -hasbound  Indicates existence of boundaries in partial annotation ("|")

AUTHOR

       This  manual  page  was written by Koichi Akabe vbkaisetsu@gmail.com for the Debian system
       (and may be used by others).  Permission is granted to copy, distribute and/or modify this
       document  under  the  terms of the GNU General Public License, Version 2 any later version
       published by the Free Software Foundation.

       On Debian systems, the complete text of the GNU General Public License  can  be  found  in
       /usr/share/common-licenses/GPL.

                                                                                   TRAIN-KYTEA(1)