Provided by: ucto_0.14-2_amd64 bug


       ucto - Unicode Tokenizer


       ucto [[options]] [input‐file] [[output‐file]]


       ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and
       optionally paragraphs), and finds paired quotes.  Ucto is preconfigured with  tokenisation
       rules for several languages.


       -c configfile
              read settings from a file

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -N value
              set UTF8 output normalization. (default NFC)

              disable filtering of special characters, (default YES) These special characters can
              be specified in the [FILTER] block of the configuration file.

              OBSOLETE. use --filter=NO

       -L language
              Automatically selects a configuration file by language code.  The language code  is
              generally  a  three-letter iso-639-3 code.  For example, 'fra' will select the file
              tokconfig‐fra from the installation directory

              try to detect all the specified languages. The default language  will  be  'lang1'.
              (only useful for FoLiA output)

              Convert to all lowercase

              Convert to all uppercase

              Emit one sentence per line on output

              Assume one sentence per line on input

              map  all  occurrences  of  tokens  with class1,...class to their generic names. e.g
              --normalize=DATE will map all dates to the word {{DATE}}. Very useful to  normalize
              tokens like URL's, DATE's, E-mail addresses and so on.

              Add  additional  tokens  to  the  [TOKENS] block of the default language.  The file
              should contain one TOKEN per line.

              Don't tokenize, but perform input decoding and simple token role detection

              remove most of  the  punctuation  from  the  output.  (not  from  abreviations  and
              embeddded punctuation like John's)

              Disable Paragraph Detection

              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

              Show version information

              set Verbose mode

              Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables
              usage of most other options: -nPQvs) For files with an '.xml' extension, -F is  the

              When  tokenizing  a  FoLiA XML document, search for text nodes of class 'cls'.  The
              default is "current".

              When tokenizing a FoLiA XML document, output the tokenized text in text nodes  with
              'cls'.   The default is "current".  It is recommended to have different classes for
              input and output.

              use  'cls'  for  input  and  output  of  text  from  FoLiA.  Equivalent   to   both
              --inputclass='cls' and --outputclass='cls')

              This  option is obsolete and NOT recommended. Please use the separate --inputclass=
              and --outputclass options.

              Output FoLiA XML. (this disables usage of most other options: -nPQvs)

       --id <DocId>
              Use the specified Document ID for the FoLiA XML

       -x <DocId> (obsolete)
              Output FoLiA XML, use the specified Document ID. (this disables usage of most other
              options: -nPQvs).

              obsolete Use -X and --id instead




       Maarten van Gompel

       Ko van der Sloot

                                           2018 nov 13                                    ucto(1)