Ubuntu Manpage: frog - Dutch Natural Language Toolkit

NAME

       frog - Dutch Natural Language Toolkit

SYNOPSIS

       frog [options]

       frog -t test-file

DESCRIPTION

       Frog  is  an  integration  of  memory‐-based  natural  language  processing  (NLP) modules
       developed for Dutch.  Frog's current version will (optionally) tokenize,  tag,  lemmatize,
       and  morphologically  segment  word  tokens in Dutch text files, add IOB chunks, add Named
       Entities and will assign a dependency graph to each sentence.

OPTIONS

       -c <file>  or --config=<file>
              set the configuration using 'file'.

              you can use -c lang/config-file  to  select  the  'config-file'  for  an  installed
              language 'lang'

       --debug=<modele><level>,...
              set  debug  level  per  module, indicated by a single letter: Tagger (T), Tokenizer
              (t), Lemmatizer (l), Morphological Analyzer (a), Chunker (c), Multi‐Word Units (m),
              Named Entity Recognition (n), or Parser (p). Different modules must be separated by
              commas.

              (e.g. --debug=l5,n3 sets the level for the Lemmatizer to 5 and for the NER to 3 )

       -d <level>
              set global debug level. (for all modules)

       --deep‐morph
              generate a deep morphological analysis and add it to the XML.  This  also  includes
              compound  information.   The default 'Tabbed' and JSON output is also more detailed
              in the Morpheme field.

       -e <encoding>
              set input encoding. (default UTF8)

       -h or --help
              give some help

       --language=<comma separated list of languages>
              Set the languages to work on. This parameter  is  passed  to  the  tokenizer.   The
              strings are assumed to be ISO 639-2 codes.

              The  first  language  in  the  list  will be the default, unspecified languages are
              asumed to be of that default.

              e.g. --language=nld,eng,por means: detect Dutch, English and Portuguese, with Dutch
              being the default.

              IMPORTANT  Frog can at the moment handle only one language at a time, as determined
              by the config file. So other languages mentioned here will be tokenized  correctly,
              but further they will be handled as that language.

       -n
              assume inputfile to have one sentence per line. (newline separators)

              Very  useful  when running interactive, otherwise an empty line is needed to signal
              end of input.

       --nostdout
              suppress the 'Tabbed' or JSON output to stdout. (when no outputfile  was  specified
              with -o or --outputdir)

              Especially useful when XML output is specified with -X or --xmldir.

       -o <file>
              send  'Tabbed'  output  to  'file'  instead  of stdout. Defaults to the name of the
              inputfile with '.out' appended.

       --outputdir <dir>
              send all 'Tabbed' or JSON output to 'dir' instead of stdout. Creates filenames from
              the inputfilename(s) with '.out' appended.

       --retry
              assume  a  re-run on the same input file(s). Frog wil only process those files that
              haven't been processed yet. This is accomplished by  looking  at  the  output  file
              names. (so this has no effect if neither -o, --outputdir, -X or --xmldir is used)

       --skip=[tlacnmp]
              skip  parts  of  the process: Tokenizer (t), Lemmatizer (l), Morphological Analyzer
              (a), Chunker (c), Named Entity Recognition (n), Multi-Word Units (m) or Parser (p).

              Skipping the Multiword Unit implies disabling the Parser too.

       --alpino
              Use a locally installed Alpino parser

       --alpino=server
              use a remote installed Alpino server, as specified in the frog configuration file.

       -S <port>
              Run Frog as a server on 'port'

       -t <file>
              process 'file'.

              -t can be omitted.  Frog  will  run  on  any  <file>  found  on  the  command-line.
              Wildcards  are  allowed  too.  When  NO  files  are  specied,  Frog  will  start in
              interactive mode.

       -x <xmlfile>
              process 'xmlfile', which is supposed to be in FoLiA format! If 'xmlfile' is  empty,
              and  --testdir=<dir>  is  provided,  all '.xml' files in 'dir' will be processed as
              FoLia XML.

       -X <xmlfile>
              When 'xmlfile' is specified, create a FoLiA XML output file with that name.

              When 'xmlfile' is empty, generate XML output for every inputfile.

       --textclass=<cls>
              When -x is given, use 'cls' to find AND store text in the FoLiA document(s).  Using
              --inputclass and --ptclass is in general a better choice.

       --inputclass=<cls>
              use 'cls' to find text in the FoLiA input document(s).

       --outputclass=<cls>
              use  'cls'  to  output  text  in  the  FoLiA input document(s).  Preferably this is
              another class then the inputclass.

       --testdir=<dir>
              process all files in 'dir'. When the input mode is XML, only '.xml' files are teken
              from 'dir'. see also --outputdir

       --tmpdir=<dir>
              location to store intermediate files. Default /tmp. NOT USED!

       --uttmarker=<mark>
              assume all utterances are separated by 'mark'. (the default is none).

       --threads=<n>
              use  a  maximum  of  'n'  threads.  The  default is to take whatever is needed.  In
              servermode we always run on 1 thread per session.

       -V or --version
              show version info

       --xmldir=<dir>
              generate FoLiA XML output  and  send  it  to  'dir'.  Creates  filenames  from  the
              inputfilename with '.xml' appended. (Except when it already ends with '.xml')

       -X <file>
              generate  FoLiA  XML  output  and  send  it  to 'file'. Defaults to the name of the
              inputfile(s) with '.xml' appended. (Except when it already ends with '.xml')

       --id=<id>
              When -X for FoLia is given, use 'id' to give the doc  an  ID.  The  default  is  an
              xml:id based on the filename.

       --allow-word-corrections
              Allow  the  ucto  tokenizer  to  apply simple corrections on words while processing
              FoLiA output.  For instance splitting punctuation.

       --max-parser-tokens=<num>
              Limit the size of sentences to be handled by the Parser. (Default 500 words).

              The Parser is very memory consuming. 500 Words will already need 16Gb of RAM.

       --JSONin
              The input is in JSON format. Mainly for Server mode, but works on files too.

              This implies --JSONout too!

       --JSONout
              Output will be in JSON instead of 'Tabbed'.

       --JSONout=<indent>
              Output will be in JSON instead of 'Tabbed'. The JSON will be idented by value
               'indent'. (Default is indent=0. Meaning al the JSON will be on 1 line)

       -T or --textredundancy=[full|medium|none]
              Set the text redundancy level in the tokenizer for text nodes in FoLiA output: full
              add  text  to  all levels: <p> <s> <w> etc.  minimal don't introduce text on higher
              levels, but retain what is already there.  none only introduce  text  on  <w>,  AND
              remove all text from higher levels

       --override=<section>.<parameter>=<value>
              Override a parameter from the configuration file with a different value.

              This option may be repeated several times.

BUGS

       likely

AUTHORS

       Maarten van Gompel

       Ko van der Sloot

       Antal van den Bosch

       e-mail: lamasoftware@science.ru.nl

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

BUGS

AUTHORS

SEE ALSO