Provided by: ucto_0.14-2build2_amd64 

NAME
ucto - Unicode Tokenizer
SYNOPSIS
ucto [[options]] [input‐file] [[output‐file]]
DESCRIPTION
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally
paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several
languages.
OPTIONS
-c configfile
read settings from a file
-d value
set debug mode to 'value'
-e value
set input encoding. (default UTF8)
-N value
set UTF8 output normalization. (default NFC)
--filter=[YES|NO]
disable filtering of special characters, (default YES) These special characters can be specified
in the [FILTER] block of the configuration file.
-f
OBSOLETE. use --filter=NO
-L language
Automatically selects a configuration file by language code. The language code is generally a
three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig‐fra from the
installation directory
--detectlanguages=<lang1,lang2,..langn>
try to detect all the specified languages. The default language will be 'lang1'. (only useful for
FoLiA output)
-l
Convert to all lowercase
-u
Convert to all uppercase
-n
Emit one sentence per line on output
-m
Assume one sentence per line on input
--normalize=class1,class2,..,classn
map all occurrences of tokens with class1,...class to their generic names. e.g --normalize=DATE
will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's,
E-mail addresses and so on.
--add-tokens="file"
Add additional tokens to the [TOKENS] block of the default language. The file should contain one
TOKEN per line.
--passthru
Don't tokenize, but perform input decoding and simple token role detection
--filterpunct
remove most of the punctuation from the output. (not from abreviations and embeddded punctuation
like John's)
-P
Disable Paragraph Detection
-Q
Enable Quote Detection. (this is experimental and may lead to unexpected results)
-s <string>
Set End‐of‐sentence marker. (Default <utt>)
-V
Show version information
-v
set Verbose mode
-F
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most
other options: -nPQvs) For files with an '.xml' extension, -F is the default.
--inputclass="cls"
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'. The default is
"current".
--outputclass="cls"
When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'. The
default is "current". It is recommended to have different classes for input and output.
--textclass="cls"(obsolete)
use 'cls' for input and output of text from FoLiA. Equivalent to both --inputclass='cls' and
--outputclass='cls')
This option is obsolete and NOT recommended. Please use the separate --inputclass= and
--outputclass options.
-X
Output FoLiA XML. (this disables usage of most other options: -nPQvs)
--id <DocId>
Use the specified Document ID for the FoLiA XML
-x <DocId> (obsolete)
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options:
-nPQvs).
obsolete Use -X and --id instead
BUGS
likely
AUTHORS
Maarten van Gompel proycon@anaproy.nl
Ko van der Sloot Timbl@uvt.nl
2018 nov 13 ucto(1)