Ubuntu Manpage: ucto - Unicode Tokenizer

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto  ucto  tokenizes  text  files: it separates words from punctuation, splits sentences (and optionally
       paragraphs), and finds paired  quotes.   Ucto  is  preconfigured  with  tokenisation  rules  for  several
       languages.

OPTIONS

       -c configfile
              read settings from a file

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -N value
              set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
              disable  filtering  of special characters, (default YES) These special characters can be specified
              in the [FILTER] block of the configuration file.

       -f
              OBSOLETE. use --filter=NO

       -L language
              Automatically selects a configuration file by language code.  The language  code  is  generally  a
              three-letter  iso-639-3  code.   For  example,  'fra'  will select the file tokconfig‐fra from the
              installation directory

       --detectlanguages=<lang1,lang2,..langn>
              try to detect all the specified languages. The default language will be 'lang1'.  (only useful for
              FoLiA output)

       -l
              Convert to all lowercase

       -u
              Convert to all uppercase

       -n
              Emit one sentence per line on output

       -m
              Assume one sentence per line on input

       --normalize=class1,class2,..,classn
              map all occurrences of tokens with class1,...class to their generic  names.  e.g  --normalize=DATE
              will  map  all  dates  to  the  word {{DATE}}. Very useful to normalize tokens like URL's, DATE's,
              E-mail addresses and so on.

       --add-tokens="file"
              Add additional tokens to the [TOKENS] block of the default language.  The file should contain  one
              TOKEN per line.

       --passthru
              Don't tokenize, but perform input decoding and simple token role detection

       --filterpunct
              remove  most  of the punctuation from the output. (not from abreviations and embeddded punctuation
              like John's)

       -P
              Disable Paragraph Detection

       -Q
              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

       -V
              Show version information

       -v
              set Verbose mode

       -F
              Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of  most
              other options: -nPQvs) For files with an '.xml' extension, -F is the default.

       --inputclass="cls"
              When  tokenizing  a  FoLiA  XML  document,  search  for text nodes of class 'cls'.  The default is
              "current".

       --outputclass="cls"
              When tokenizing a FoLiA XML document, output the tokenized text in text  nodes  with  'cls'.   The
              default is "current".  It is recommended to have different classes for input and output.

       --textclass="cls"(obsolete)
              use  'cls'  for  input  and  output  of text from FoLiA. Equivalent to both --inputclass='cls' and
              --outputclass='cls')

              This  option  is  obsolete  and  NOT  recommended.  Please  use  the  separate  --inputclass=  and
              --outputclass options.

       -X
              Output FoLiA XML. (this disables usage of most other options: -nPQvs)

       --id <DocId>
              Use the specified Document ID for the FoLiA XML

       -x <DocId> (obsolete)
              Output  FoLiA  XML,  use  the  specified  Document ID. (this disables usage of most other options:
              -nPQvs).

              obsolete Use -X and --id instead

BUGS

       likely

AUTHORS

       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot Timbl@uvt.nl

                                                   2018 nov 13                                           ucto(1)