lunar (1) fst-compiler-utf8.1.gz

Provided by: sfst_1.4.7b-1build2_amd64 bug

NAME

       fst-compiler, fst-compiler-utf8 - Two compilers for SFST programs

SYNOPSIS

       fst-compiler grammar-file [ output-file ]
       fst-compiler-utf8 grammar-file [ output-file ]

OPTIONS

       -c     Store the transducer in compact format which is used by fst-infl2.

       -l     Store the transducer in lowmem format.

       -s     Switch surface and analysis layer of the transducer. You have to use this switch in
              order to use fst-infl (fst-infl2, fst-infl3) for generation rather than analysis.

DESCRIPTION

       fst-compiler is a compiler for finite-state transducer programs. It generates a  minimized
       finite  state transducer which can be used with fst-mor, fst-infl, fst-print, fst-compare,
       fst-parse, and fst-lattice.  The compact transducer representation which is generated with
       the  -c  flag,  is supported by fst-infl2, fst-train, and fst-match.  The memory-efficient
       transducer representation which is generated with the -l flag, is only supported  by  fst-
       infl3.

       The  first  program  argument is the name of a file which contains the transducer program.
       The programming language is described below. The second argument is the name of  the  file
       to  which the resulting transducer will be written in binary form. If a second argument is
       missing, the output will be written to stdout.

       fst-compiler-utf8 differs from fst-compiler only in the character encoding.  fst-compiler-
       utf8  supports  UTF8  encoding  of the source files whereas fst-compiler is to be used for
       8-Bit character codes like latin1 which are an extension of the  ASCII  code.  Information
       about the encoding is stored in the transducer files and used by the other SFST programs.

FILE FORMATS

       A  transducer  program  consists  of  an  (optional)  sequence  of  alphabet  and variable
       definitions  followed  by  a  single  transducer  expression  which  defines  the   result
       transducer.

       Alphabet

       An  alphabet definition consists of the keyword ALPHABET followed by = and some transducer
       expression e.g.

           ALPHABET = [a-z]:[A-Z]

       This command redefines  the  alphabet  as  the  set  of  symbol  pairs  occurring  on  the
       transitions  of the transducer. Occurrences of two-level operators, negation operators and
       unquoted periods always have to be preceded by an alphabet definition.

       Variables

       There are two different types of variables.  Symbol set variables  are  enclosed  by  hash
       signs (#) and take symbol sequences (see below) as values:

           #UC# = A-Z
           #LC# = a-z

       Transducer  variables  are  enclosed  by  dollar  signs and take transducer expressions as
       values:

           $MAP$ = [a-z]:[A-Z]+
           $MAP$ = [#LC#]:[#UC#]+

       Variables whose name starts with the symbol `=' are special  agreement  variables.  If  an
       agreement  variable  occurs more than once in a transducer expression, it will always have
       the same value. Consider the following transducer program:

           $=1$ = [abc]
           $=1$ X $=1$

       The result transducer recognizes the strings aXa, bXb, and cXc. Only  acyclic  transducers
       (i.e.  transducers  with  a  finite  set  of string mappings) can be assigned to agreement
       variables.

       Symbols

       A symbol is either

       - a single character like A s 5,

       - a quoted character like \* or \_,

       - a multi-character symbol like <X> or <ab.c5> (which is always
         enclosed in angle brackets) or

       - a backslash followed by a number which is the numeric code of the
         designated character

       - the null symbol <>.

       Symbol sequence

       A symbol sequence is a sequence  of  characters,  multi-character  symbols  and  character
       ranges, e.g. a-z \. <x>.

       symbol range

       A symbol range is either

       - a single symbol

       - a symbol sequence enclosed in square brackets like [A-Za-z] or

       -  a  symbol  sequence  starting  with  ^  and  enclosed in square brackets like [^A-Za-z]
       (designating the complement of [a-zA-Z]) or

       - the period (which represents any symbol from the alphabet)

       Transducer expressions

       A transducer expression (TE) is recursively defined as follows:

       - A pair of two symbol ranges separated by a colon is a TE.

         [a-z]:[a-Z]

       - A single symbol range like [a-z] is a TE.
         It is a short form for [a-z]:[a-z].

       - Two symbol sequences enclosed in braces and separated by a colon are
        a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d c:e <>:f.

       - X Y is a TE if X and Y are TEs.
         (Blanks are ignored unless they are quoted.)

       - (X) is a TE if X is a TE.

       - X op is a TE is X is a TE and op is either * (Kleene's star operator), +
        (Kleene's plus operator), or ? (optionality operator)

       - op X is a TE is X is a TE and op is either ! (negation operator), ^
        (target language extraction operator), _ (source language  extraction  operator),  or  ^_
        (source and target switch operator).

       - X op Y is a TE is X and Y are TEs and op is either & (conjunction
        operator),  |  (disjunction  operator),  ||  (composition  operator),  or  - (subtraction
        operator)

       - L x op y R is a TE if L and R are TEs, x and y are symbol ranges and
        op is either => (two-level restriction),  <=  (two-level  coercion),  or  <=>  (two-level
        restriction and coercion).

       - X op L__R is a TE if X, L and R are TEs and op is either ^-> (upward
        replacement),  _->  (downward  replacement), /-> (leftward replacement) or \-> (rightward
        replacement). Furthermore, L and R must define automata (i.e.  which  map  their  strings
        onto  themselves).  These  operators  correspond to Karttunen's replace operators. If the
        arrow is followed by a question mark (?), the replacement becomes optional.

       - X << l is a TE if X is a TE, and l is either of the form
        a or the form a:b where a and b are  single  characters  or  symbols.  The  result  is  a
        transducer  where  l  was  freely inserted into X. The transducer ab << c for instance is
        equivalent to c*ac*bc*.

       - X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN and R1
        through RN are TEs, and op is either => (general restriction), <= (general coercion), ^=>
        (general  surface  restriction),  ^<=  (general  surface coercion), ^<=> (general surface
        restriction and coercion), _=> (general deep restriction), _<= (general  deep  coercion),
        _<=> (general deep restriction and coercion). (These operators were implemented following
        a suggestion by Anssi Yli-Jyra.)

       - "fname" is a TE. The compiler reads the file named fname and turns
        it into a transducer of the form line1|line2|line3|... where linex is the  x-th  line  of
        the  file.  All  characters  other  than  :  and \ are interpreted literally (i.e. not as
        operators). This TE is typically used e.g. to read morpheme list from a file.

       - "<fname>" is a TE. The compiler reads a pre-compiled transducer from
        the file named fname. This

       Further Features

       Comments start with the symbol % and extend up to the end of the line.  Blanks are ignored
       unless  they are quoted. Expressions terminate at the end of a line unless the end of line
       is preceded by a backslash.  The command

       #include "fname"

       can be used to insert source code from a file named fname.  The command

       RE >> "fname"

       stores the regular expression RE in the file fname.  The command

       #use hopcroft

       tells the compiler to use the Hopcroft minimisation algorithm from now on, and

       #use default

       switches back to the default minimisation algorithm (Brzozowski).  The command

EXAMPLE

       Here is an example of a simple transducer program.  Assuming  that  the  file  "adj-stems"
       contains the two lines

          easy
          late
          big

       this transducer will correctly analyze the adjective forms easy, easier, easiest and late,
       later, and latest.

       ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>

       $R$ = y<=>i (<ADJ>:<> e)

       $R2$ = e<=><> (<ADJ>:<> e)

       $R$ = $R$ & $R2$

       $Stems$ = "adj-stems"

       $S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

       $S$ || $R$

EXIT STATUS

       fst-compiler returns 0 unless some error occurs.

BUGS

       The compiler gets the operator precedence wrong in case of two-level rules and  interprets
       the  expression  "ab c<=>d ef" as "a(b c<=>d (ef))". Therefore, you should always surround
       the left context of two-level rules with parenthesis: (ab) c<=>d (ef)

SEE ALSO

       fst-mor, fst-infl, fst-infl2, fst-infl3, fst-print, fst-compact,  fst-parse,  fst-compare,
       fst-compact, fst-lowmem, fst-lattice, fst-train

AUTHOR

       Helmut  Schmid,  Institute  for Computational Linguistics, University of Stuttgart, Email:
       schmid@ims.uni-stuttgart.de, This software is available under the GNU Public License.

                                          December 2004                           fst-compiler(1)