Provided by: sfst_1.4.7b-1build2_amd64 bug

NAME

       fst-compiler, fst-compiler-utf8 - Two compilers for SFST programs

SYNOPSIS

       fst-compiler grammar-file [ output-file ]
       fst-compiler-utf8 grammar-file [ output-file ]

OPTIONS

       -c     Store the transducer in compact format which is used by fst-infl2.

       -l     Store the transducer in lowmem format.

       -s     Switch  surface  and analysis layer of the transducer. You have to use this switch in order to use
              fst-infl (fst-infl2, fst-infl3) for generation rather than analysis.

DESCRIPTION

       fst-compiler is a compiler for finite-state transducer programs. It generates a  minimized  finite  state
       transducer  which can be used with fst-mor, fst-infl, fst-print, fst-compare, fst-parse, and fst-lattice.
       The compact transducer representation which is generated with the -c flag,  is  supported  by  fst-infl2,
       fst-train,  and fst-match.  The memory-efficient transducer representation which is generated with the -l
       flag, is only supported by fst-infl3.

       The first program argument is the name of a file which contains the transducer program.  The  programming
       language  is  described  below.  The  second  argument  is  the  name  of the file to which the resulting
       transducer will be written in binary form. If a second argument is missing, the output will be written to
       stdout.

       fst-compiler-utf8  differs  from fst-compiler only in the character encoding.  fst-compiler-utf8 supports
       UTF8 encoding of the source files whereas fst-compiler is to be  used  for  8-Bit  character  codes  like
       latin1  which  are  an  extension  of  the  ASCII  code.  Information about the encoding is stored in the
       transducer files and used by the other SFST programs.

FILE FORMATS

       A transducer program consists of an (optional) sequence of alphabet and variable definitions followed  by
       a single transducer expression which defines the result transducer.

       Alphabet

       An alphabet definition consists of the keyword ALPHABET followed by = and some transducer expression e.g.

           ALPHABET = [a-z]:[A-Z]

       This  command  redefines  the  alphabet  as  the  set of symbol pairs occurring on the transitions of the
       transducer. Occurrences of two-level operators, negation operators and unquoted periods always have to be
       preceded by an alphabet definition.

       Variables

       There are two different types of variables.  Symbol set variables are enclosed by hash signs (#) and take
       symbol sequences (see below) as values:

           #UC# = A-Z
           #LC# = a-z

       Transducer variables are enclosed by dollar signs and take transducer expressions as values:

           $MAP$ = [a-z]:[A-Z]+
           $MAP$ = [#LC#]:[#UC#]+

       Variables whose name starts with the symbol `=' are special agreement variables. If an agreement variable
       occurs  more  than  once  in  a  transducer  expression, it will always have the same value. Consider the
       following transducer program:

           $=1$ = [abc]
           $=1$ X $=1$

       The result transducer  recognizes  the  strings  aXa,  bXb,  and  cXc.  Only  acyclic  transducers  (i.e.
       transducers with a finite set of string mappings) can be assigned to agreement variables.

       Symbols

       A symbol is either

       - a single character like A s 5,

       - a quoted character like \* or \_,

       - a multi-character symbol like <X> or <ab.c5> (which is always
         enclosed in angle brackets) or

       - a backslash followed by a number which is the numeric code of the
         designated character

       - the null symbol <>.

       Symbol sequence

       A  symbol sequence is a sequence of characters, multi-character symbols and character ranges, e.g. a-z \.
       <x>.

       symbol range

       A symbol range is either

       - a single symbol

       - a symbol sequence enclosed in square brackets like [A-Za-z] or

       - a symbol sequence starting with ^ and enclosed in  square  brackets  like  [^A-Za-z]  (designating  the
       complement of [a-zA-Z]) or

       - the period (which represents any symbol from the alphabet)

       Transducer expressions

       A transducer expression (TE) is recursively defined as follows:

       - A pair of two symbol ranges separated by a colon is a TE.

         [a-z]:[a-Z]

       - A single symbol range like [a-z] is a TE.
         It is a short form for [a-z]:[a-z].

       - Two symbol sequences enclosed in braces and separated by a colon are
        a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d c:e <>:f.

       - X Y is a TE if X and Y are TEs.
         (Blanks are ignored unless they are quoted.)

       - (X) is a TE if X is a TE.

       - X op is a TE is X is a TE and op is either * (Kleene's star operator), +
        (Kleene's plus operator), or ? (optionality operator)

       - op X is a TE is X is a TE and op is either ! (negation operator), ^
        (target language extraction operator), _ (source language extraction operator), or ^_ (source and target
        switch operator).

       - X op Y is a TE is X and Y are TEs and op is either & (conjunction
        operator), | (disjunction operator), || (composition operator), or - (subtraction operator)

       - L x op y R is a TE if L and R are TEs, x and y are symbol ranges and
        op is either => (two-level restriction), <= (two-level coercion),  or  <=>  (two-level  restriction  and
        coercion).

       - X op L__R is a TE if X, L and R are TEs and op is either ^-> (upward
        replacement),  _->  (downward  replacement),  /-> (leftward replacement) or \-> (rightward replacement).
        Furthermore, L and R must define  automata  (i.e.  which  map  their  strings  onto  themselves).  These
        operators  correspond to Karttunen's replace operators. If the arrow is followed by a question mark (?),
        the replacement becomes optional.

       - X << l is a TE if X is a TE, and l is either of the form
        a or the form a:b where a and b are single characters or symbols. The result is a transducer where l was
        freely inserted into X. The transducer ab << c for instance is equivalent to c*ac*bc*.

       - X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN and R1
        through  RN  are  TEs,  and  op  is either => (general restriction), <= (general coercion), ^=> (general
        surface restriction), ^<= (general surface coercion), ^<=> (general surface restriction  and  coercion),
        _=>  (general  deep  restriction),  _<=  (general  deep  coercion),  _<=>  (general deep restriction and
        coercion). (These operators were implemented following a suggestion by Anssi Yli-Jyra.)

       - "fname" is a TE. The compiler reads the file named fname and turns
        it into a transducer of the form line1|line2|line3|... where linex is the x-th line  of  the  file.  All
        characters  other  than  : and \ are interpreted literally (i.e. not as operators). This TE is typically
        used e.g. to read morpheme list from a file.

       - "<fname>" is a TE. The compiler reads a pre-compiled transducer from
        the file named fname. This

       Further Features

       Comments start with the symbol % and extend up to the end of the line.  Blanks are  ignored  unless  they
       are quoted. Expressions terminate at the end of a line unless the end of line is preceded by a backslash.
       The command

       #include "fname"

       can be used to insert source code from a file named fname.  The command

       RE >> "fname"

       stores the regular expression RE in the file fname.  The command

       #use hopcroft

       tells the compiler to use the Hopcroft minimisation algorithm from now on, and

       #use default

       switches back to the default minimisation algorithm (Brzozowski).  The command

EXAMPLE

       Here is an example of a simple transducer program. Assuming that the file "adj-stems"  contains  the  two
       lines

          easy
          late
          big

       this  transducer  will  correctly  analyze the adjective forms easy, easier, easiest and late, later, and
       latest.

       ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>

       $R$ = y<=>i (<ADJ>:<> e)

       $R2$ = e<=><> (<ADJ>:<> e)

       $R$ = $R$ & $R2$

       $Stems$ = "adj-stems"

       $S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

       $S$ || $R$

EXIT STATUS

       fst-compiler returns 0 unless some error occurs.

BUGS

       The compiler gets the operator precedence wrong in case of two-level rules and interprets the  expression
       "ab  c<=>d  ef" as "a(b c<=>d (ef))". Therefore, you should always surround the left context of two-level
       rules with parenthesis: (ab) c<=>d (ef)

SEE ALSO

       fst-mor, fst-infl, fst-infl2, fst-infl3, fst-print,  fst-compact,  fst-parse,  fst-compare,  fst-compact,
       fst-lowmem, fst-lattice, fst-train

AUTHOR

       Helmut  Schmid,  Institute for Computational Linguistics, University of Stuttgart, Email: schmid@ims.uni-
       stuttgart.de, This software is available under the GNU Public License.

                                                  December 2004                                  fst-compiler(1)