Ubuntu Manpage: ucto - Unicode Tokenizer

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto  tokenizes  text  files:  it  separates  words  from  punctuation,  splits sentences (and optionally
       paragraphs), and finds paired  quotes.   Ucto  is  preconfigured  with  tokenisation  rules  for  several
       languages.

       Those rules are provided by uctodata

OPTIONS

-c configfile
read settings from a 'configfile'

-B
run in batch mode. Process all inputfiles to an output directory specified with -O.

-d value
set debug mode to 'value'

-e value
set input encoding. (default UTF8)

-I value
set the input directory to 'value'. (batch mode only)

-O value
set the ouput directory to 'value'. (Required for batch mode)

-N value
set UTF8 output normalization. (default NFC)

--filter=[YES|NO]
disable filtering of special characters, (default YES) These special characters can be specified
in the [FILTER] block of the configuration file.

-L language
Automatically selects a configuration file by language code. The language code is generally a
three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig‐fra from the
installation directory

--detectlanguages=<lang1,lang2,..langn>
try to detect all the specified languages. The default language will be 'lang1'. (only useful for
FoLiA output).

All values must be iso-639-3 codes.

You can also use the special language code `und`. This ensures there is NO default language, and
any language that is NOT in the list will remain unanalyzed.

Warning: To be able to handle utterances of mixed language, Ucto uses a simple sentence splitter
based on the markers '.' '?' and '!'. This may occasionally lead to surprising results.

-l
Convert output text to all lowercase

-u
Convert all input text to all uppercase

-n
Emit one sentence per line on output

-m
Assume one sentence per line on input

--normalize=class1,class2,..,classn
map all occurrences of tokens with class1,...class to their generic names. e.g --normalize=DATE
will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's,
E-mail addresses and so on.

-T value or --textredundancy=value
set text redundancy level for text nodes in FoLiA output:
'full' - add text to all levels: <p> <s> <w> etc.
'minimal' - don't introduce text on higher levels, but retain what is already
there.
'none' - only introduce text on <w>, AND remove all text from higher levels

--allow-word-correction
Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections

--ignore-tag-hints
Skip all tag=token hints from the FoLiA input. These hints can be used to signal text markup like
subscript and superscript

--add-tokens="file"
Add additional tokens to the [TOKENS] block of the default language. The file should contain one
TOKEN per line.

--passthru
Don't tokenize, but perform input decoding and simple token role detection

--filterpunct
remove most of the punctuation from the output. (not from abreviations and embedded punctuation
like John's)

-P
Disable Paragraph Detection

-Q
Enable Quote Detection. (this is experimental and may lead to unexpected results)

-s <string>
Set End‐of‐sentence marker. (Default <utt>)

-V or -- version
Show version information

-v
set Verbose mode

-F
The input file(s) are assumed to be FoLiA XML. Text in the correct 'inputclass' will be tokenized.
For files with an '.xml' extension, -F is the default.

In batch mode, this forces to only select files with the '.xml' extension from the input
directory.

--inputclass="cls"
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'. The default is
"current".

--outputclass="cls"
When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'. The
default is "current". It is recommended to have different classes for input and output.

--textclass="cls"(obsolete)
use 'cls' for input and output of text from FoLiA. Equivalent to both --inputclass='cls' and
--outputclass='cls')

This option is obsolete and NOT recommended. Please use the separate --inputclass= and
--outputclass options.

--copyclass
when ucto is used on FoLiA with fully tokenized text in inputclass='inputclass', no text in
textclass 'outputclass' is produced. (A warning will be given). To circumvent this. Add the
--copyclass option. Which assures that text will be emitted in that class

-X
All output will be FoLiA XML. Document id's are autogenerated.

Works in batch mode too.

--id <DocId>
Use the specified Document ID for the FoLiA XML. (not allowed in batch mode) When not provided, a
document is is generated based on the nema of the input file.

BUGS

       likely

AUTHORS

       Maarten van Gompel

       Ko van der Sloot

       e-mail: lamasoftware@science.ru.nl

                                                   2024 apr 11                                           ucto(1)