Ubuntu Manpage: ucto - Unicode Tokenizer

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto  ucto  tokenizes  text  files: it separates words from punctuation, splits sentences (and optionally
       paragraphs), and finds paired  quotes.   Ucto  is  preconfigured  with  tokenisation  rules  for  several
       languages.

OPTIONS

       -c configfile
              read settings from a file

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -N value
              set UTF8 output normalization. (default NFC)

       -f
              disable filtering of special characters

       -L language
               Automatically selects a configuration file by language code.
               The language code is generally a three-letter iso-639-3 code.  For example, 'fra' will select the
              file tokconfig‐fra from the installation directory

       -l
              Convert to all lowercase

       -u
              Convert to all uppercase

       -n
              Emit one sentence per line on output

       -m
              Assume one sentence per line on input

       --passthru
              Don't tokenize, but perform input decoding and simple token role detection

       --filterpunct
              remove most of the punctuation from the output. (not from abreviations!)

       -P
              Disable Paragraph Detection

       -Q
              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -S
              Disable Sentence Detection

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

       -V
              Show version information

       -v
              set Verbose mode

       -F
              Read  a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most
              other options: -nulPQvsS)

       --textclasscls
              When tokenizing a FoLiA XML document, search for text nodes of class 'cls'

       -X
              Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)

       --id <DocId>
              Use the specified Document ID for the FoLiA XML

       -x <DocId> (obsolete)
              Output FoLiA XML, use the specified Document ID. (this  disables  usage  of  most  other  options:
              -nulPQvsS)

              obsolete Use -X and --id instead

BUGS

       likely

AUTHORS

       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot Timbl@uvt.nl

                                                 2014 december 2                                         ucto(1)