Provided by: ucto_0.14-2build2_amd64 bug

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto  ucto  tokenizes  text  files: it separates words from punctuation, splits sentences (and optionally
       paragraphs), and finds paired  quotes.   Ucto  is  preconfigured  with  tokenisation  rules  for  several
       languages.

OPTIONS

       -c configfile
              read settings from a file

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -N value
              set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
              disable  filtering  of special characters, (default YES) These special characters can be specified
              in the [FILTER] block of the configuration file.

       -f
              OBSOLETE. use --filter=NO

       -L language
              Automatically selects a configuration file by language code.  The language  code  is  generally  a
              three-letter  iso-639-3  code.   For  example,  'fra'  will select the file tokconfig‐fra from the
              installation directory

       --detectlanguages=<lang1,lang2,..langn>
              try to detect all the specified languages. The default language will be 'lang1'.  (only useful for
              FoLiA output)

       -l
              Convert to all lowercase

       -u
              Convert to all uppercase

       -n
              Emit one sentence per line on output

       -m
              Assume one sentence per line on input

       --normalize=class1,class2,..,classn
              map  all  occurrences  of tokens with class1,...class to their generic names. e.g --normalize=DATE
              will map all dates to the word {{DATE}}. Very useful  to  normalize  tokens  like  URL's,  DATE's,
              E-mail addresses and so on.

       --add-tokens="file"
              Add  additional tokens to the [TOKENS] block of the default language.  The file should contain one
              TOKEN per line.

       --passthru
              Don't tokenize, but perform input decoding and simple token role detection

       --filterpunct
              remove most of the punctuation from the output. (not from abreviations and  embeddded  punctuation
              like John's)

       -P
              Disable Paragraph Detection

       -Q
              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

       -V
              Show version information

       -v
              set Verbose mode

       -F
              Read  a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most
              other options: -nPQvs) For files with an '.xml' extension, -F is the default.

       --inputclass="cls"
              When tokenizing a FoLiA XML document, search for text  nodes  of  class  'cls'.   The  default  is
              "current".

       --outputclass="cls"
              When  tokenizing  a  FoLiA  XML document, output the tokenized text in text nodes with 'cls'.  The
              default is "current".  It is recommended to have different classes for input and output.

       --textclass="cls"(obsolete)
              use 'cls' for input and output of text from  FoLiA.  Equivalent  to  both  --inputclass='cls'  and
              --outputclass='cls')

              This  option  is  obsolete  and  NOT  recommended.  Please  use  the  separate  --inputclass=  and
              --outputclass options.

       -X
              Output FoLiA XML. (this disables usage of most other options: -nPQvs)

       --id <DocId>
              Use the specified Document ID for the FoLiA XML

       -x <DocId> (obsolete)
              Output FoLiA XML, use the specified Document ID. (this  disables  usage  of  most  other  options:
              -nPQvs).

              obsolete Use -X and --id instead

BUGS

       likely

AUTHORS

       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot Timbl@uvt.nl

                                                   2018 nov 13                                           ucto(1)