Provided by: ucto_0.30-3build1_amd64 

NAME
ucto - Unicode Tokenizer
SYNOPSIS
ucto [[options]] [input‐file] [[output‐file]]
DESCRIPTION
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.
OPTIONS
-c configfile read settings from a file -d value set debug mode to 'value' -e value set input encoding. (default UTF8) -N value set UTF8 output normalization. (default NFC) --filter=[YES|NO] disable filtering of special characters, (default YES) These special characters can be specified in the [FILTER] block of the configuration file. -f OBSOLETE. use --filter=NO -L language Automatically selects a configuration file by language code. The language code is generally a three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig‐fra from the installation directory --detectlanguages=<lang1,lang2,..langn> try to detect all the specified languages. The default language will be 'lang1'. (only useful for FoLiA output). All language codes must be iso-639-3. You can use the special language code `und`. This ensures there is NO default language, but any language that is NOT in the list will remain unanalyzed. Warning: To be able to handle utterances of mixed language, Ucto uses a simple sentence splitter based on the markers '.' '?' and '!'. This may occasionally lead to surprising results. -l Convert to all lowercase -u Convert to all uppercase -n Emit one sentence per line on output -m Assume one sentence per line on input --normalize=class1,class2,..,classn map all occurrences of tokens with class1,...class to their generic names. e.g --normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E-mail addresses and so on. -T value or --textredundancy=value set text redundancy level for text nodes in FoLiA output: 'full' - add text to all levels: <p> <s> <w> etc. 'minimal' - don't introduce text on higher levels, but retain what is already there. 'none' - only introduce text on <w>, AND remove all text from higher levels --allow-word-correction Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections --ignore-tag-hints Skip all tag=token hints from the FoLiA input. These hints can be used to signal text markup like subscript and superscript --add-tokens="file" Add additional tokens to the [TOKENS] block of the default language. The file should contain one TOKEN per line. --passthru Don't tokenize, but perform input decoding and simple token role detection --filterpunct remove most of the punctuation from the output. (not from abreviations and embedded punctuation like John's) -P Disable Paragraph Detection -Q Enable Quote Detection. (this is experimental and may lead to unexpected results) -s <string> Set End‐of‐sentence marker. (Default <utt>) -V or -- version Show version information -v set Verbose mode -F Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nPQvs) For files with an '.xml' extension, -F is the default. --inputclass="cls" When tokenizing a FoLiA XML document, search for text nodes of class 'cls'. The default is "current". --outputclass="cls" When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'. The default is "current". It is recommended to have different classes for input and output. --textclass="cls"(obsolete) use 'cls' for input and output of text from FoLiA. Equivalent to both --inputclass='cls' and --outputclass='cls') This option is obsolete and NOT recommended. Please use the separate --inputclass= and --outputclass options. --copyclass when ucto is used on FoLiA with fully tokenized text in inputclass='inputclass', no text in textclass 'outputclass' is produced. (A warning will be given). To circumvent this. Add the --copyclass option. Which assures that text will be emitted in that class -X Output FoLiA XML. (this disables usage of most other options: -nPQvs) --id <DocId> Use the specified Document ID for the FoLiA XML -x <DocId> (obsolete) Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nPQvs). obsolete Use -X and --id instead
BUGS
likely
AUTHORS
Maarten van Gompel proycon@anaproy.nl Ko van der Sloot Timbl@uvt.nl 2023 apr 21 ucto(1)