Provided by: ucto_0.30-3build1_amd64 bug

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto  ucto  tokenizes  text  files: it separates words from punctuation, splits sentences (and optionally
       paragraphs), and finds paired  quotes.   Ucto  is  preconfigured  with  tokenisation  rules  for  several
       languages.

OPTIONS

       -c configfile
              read settings from a file

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -N value
              set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
              disable  filtering  of special characters, (default YES) These special characters can be specified
              in the [FILTER] block of the configuration file.

       -f
              OBSOLETE. use --filter=NO

       -L language
              Automatically selects a configuration file by language code.  The language  code  is  generally  a
              three-letter  iso-639-3  code.   For  example,  'fra'  will select the file tokconfig‐fra from the
              installation directory

       --detectlanguages=<lang1,lang2,..langn>
              try to detect all the specified languages. The default language will be 'lang1'.  (only useful for
              FoLiA output).

              All language codes must be iso-639-3.

              You  can  use  the special language code `und`. This ensures there is NO default language, but any
              language that is NOT in the list will remain unanalyzed.

              Warning: To be able to handle utterances of mixed language, Ucto uses a simple  sentence  splitter
              based on the markers '.' '?' and '!'.  This may occasionally lead to surprising results.

       -l
              Convert to all lowercase

       -u
              Convert to all uppercase

       -n
              Emit one sentence per line on output

       -m
              Assume one sentence per line on input

       --normalize=class1,class2,..,classn
              map  all  occurrences  of tokens with class1,...class to their generic names. e.g --normalize=DATE
              will map all dates to the word {{DATE}}. Very useful  to  normalize  tokens  like  URL's,  DATE's,
              E-mail addresses and so on.

       -T value or --textredundancy=value
              set text redundancy level for text nodes in FoLiA output:
               'full'    - add text to all levels: <p> <s> <w> etc.
               'minimal' - don't introduce text on higher levels, but retain what is already
               there.
               'none'    - only introduce text on <w>, AND remove all text from higher levels

       --allow-word-correction
              Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections

       --ignore-tag-hints
              Skip  all tag=token hints from the FoLiA input. These hints can be used to signal text markup like
              subscript and superscript

       --add-tokens="file"
              Add additional tokens to the [TOKENS] block of the default language.  The file should contain  one
              TOKEN per line.

       --passthru
              Don't tokenize, but perform input decoding and simple token role detection

       --filterpunct
              remove  most  of  the punctuation from the output. (not from abreviations and embedded punctuation
              like John's)

       -P
              Disable Paragraph Detection

       -Q
              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

       -V or -- version
              Show version information

       -v
              set Verbose mode

       -F
              Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of  most
              other options: -nPQvs) For files with an '.xml' extension, -F is the default.

       --inputclass="cls"
              When  tokenizing  a  FoLiA  XML  document,  search  for text nodes of class 'cls'.  The default is
              "current".

       --outputclass="cls"
              When tokenizing a FoLiA XML document, output the tokenized text in text  nodes  with  'cls'.   The
              default is "current".  It is recommended to have different classes for input and output.

       --textclass="cls"(obsolete)
              use  'cls'  for  input  and  output  of text from FoLiA. Equivalent to both --inputclass='cls' and
              --outputclass='cls')

              This  option  is  obsolete  and  NOT  recommended.  Please  use  the  separate  --inputclass=  and
              --outputclass options.

       --copyclass
              when  ucto  is  used  on  FoLiA  with  fully tokenized text in inputclass='inputclass', no text in
              textclass 'outputclass' is produced. (A warning will be  given).   To  circumvent  this.  Add  the
              --copyclass option. Which assures that text will be emitted in that class

       -X
              Output FoLiA XML. (this disables usage of most other options: -nPQvs)

       --id <DocId>
              Use the specified Document ID for the FoLiA XML

       -x <DocId> (obsolete)
              Output  FoLiA  XML,  use  the  specified  Document ID. (this disables usage of most other options:
              -nPQvs).

              obsolete Use -X and --id instead

BUGS

       likely

AUTHORS

       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot Timbl@uvt.nl

                                                   2023 apr 21                                           ucto(1)