Ubuntu Manpage: ucto - Unicode Tokenizer

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and
       optionally paragraphs), and finds paired quotes.  Ucto is preconfigured with  tokenisation
       rules for several languages.

OPTIONS

       -c configfile
              read settings from a file

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -N value
              set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
              disable filtering of special characters, (default YES) These special characters can
              be specified in the [FILTER] block of the configuration file.

       -f
              OBSOLETE. use --filter=NO

       -L language
              Automatically selects a configuration file by language code.  The language code  is
              generally  a  three-letter iso-639-3 code.  For example, 'fra' will select the file
              tokconfig‐fra from the installation directory

       --detectlanguages=<lang1,lang2,..langn>
              try to detect all the specified languages. The default language  will  be  'lang1'.
              (only useful for FoLiA output)

       -l
              Convert to all lowercase

       -u
              Convert to all uppercase

       -n
              Emit one sentence per line on output

       -m
              Assume one sentence per line on input

       --normalize=class1,class2,..,classn
              map  all  occurrences  of  tokens  with class1,...class to their generic names. e.g
              --normalize=DATE will map all dates to the word {{DATE}}. Very useful to  normalize
              tokens like URL's, DATE's, E-mail addresses and so on.

       --add-tokens="file"
              Add  additional  tokens  to  the  [TOKENS] block of the default language.  The file
              should contain one TOKEN per line.

       --passthru
              Don't tokenize, but perform input decoding and simple token role detection

       --filterpunct
              remove most of  the  punctuation  from  the  output.  (not  from  abreviations  and
              embeddded punctuation like John's)

       -P
              Disable Paragraph Detection

       -Q
              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

       -V
              Show version information

       -v
              set Verbose mode

       -F
              Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables
              usage of most other options: -nPQvs) For files with an '.xml' extension, -F is  the
              default.

       --inputclass="cls"
              When  tokenizing  a  FoLiA XML document, search for text nodes of class 'cls'.  The
              default is "current".

       --outputclass="cls"
              When tokenizing a FoLiA XML document, output the tokenized text in text nodes  with
              'cls'.   The default is "current".  It is recommended to have different classes for
              input and output.

       --textclass="cls"(obsolete)
              use  'cls'  for  input  and  output  of  text  from  FoLiA.  Equivalent   to   both
              --inputclass='cls' and --outputclass='cls')

              This  option is obsolete and NOT recommended. Please use the separate --inputclass=
              and --outputclass options.

       -X
              Output FoLiA XML. (this disables usage of most other options: -nPQvs)

       --id <DocId>
              Use the specified Document ID for the FoLiA XML

       -x <DocId> (obsolete)
              Output FoLiA XML, use the specified Document ID. (this disables usage of most other
              options: -nPQvs).

              obsolete Use -X and --id instead

BUGS

       likely

AUTHORS

       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot Timbl@uvt.nl

                                           2018 nov 13                                    ucto(1)