Provided by: ucto_0.9.6-1build2_amd64
NAME
ucto - Unicode Tokenizer
SYNOPSIS
ucto [[options]] [input‐file] [[output‐file]]
DESCRIPTION
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.
OPTIONS
-c configfile read settings from a file -d value set debug mode to 'value' -e value set input encoding. (default UTF8) -N value set UTF8 output normalization. (default NFC) -f disable filtering of special characters -L language Automatically selects a configuration file by language code. The language code is generally a three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig‐fra from the installation directory -l Convert to all lowercase -u Convert to all uppercase -n Emit one sentence per line on output -m Assume one sentence per line on input --passthru Don't tokenize, but perform input decoding and simple token role detection --filterpunct remove most of the punctuation from the output. (not from abreviations!) -P Disable Paragraph Detection -Q Enable Quote Detection. (this is experimental and may lead to unexpected results) -S Disable Sentence Detection -s <string> Set End‐of‐sentence marker. (Default <utt>) -V Show version information -v set Verbose mode -F Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS) --textclasscls When tokenizing a FoLiA XML document, search for text nodes of class 'cls' -X Output FoLiA XML. (this disables usage of most other options: -nulPQvsS) --id <DocId> Use the specified Document ID for the FoLiA XML -x <DocId> (obsolete) Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS) obsolete Use -X and --id instead
BUGS
likely
AUTHORS
Maarten van Gompel proycon@anaproy.nl Ko van der Sloot Timbl@uvt.nl 2014 december 2 ucto(1)