Ubuntu Manpage: mmorph - MULTEXT morphology tool formalism syntax

NAME

       mmorph - MULTEXT morphology tool formalism syntax

DESCRIPTION

       A  mmorph  morphology description file is divided into declaration sections.  Each section
       starts by a section header (`@ Alphabets', `@ Attributes', etc.)  followed by  a  sequence
       of  declarations.   Each  declaration  starts by a name, followed by a colon (`:') and the
       definition associated to the name.  Here is a brief description of each section:

@ Alphabets

       In this section the lexical and surface alphabet are declared.  All symbols  forming  each
       alphabet  has  to  be listed.  Symbols may appear in both the lexical and surface alphabet
       definition in which case it is considered a bi-level symbol, otherwise  it  is  a  lexical
       only  or  surface  only symbol.  Symbols are usually letters (eg.  a, b, c) , but may also
       consist of longer names (beta, schwa).  Symbol names consisting of one  special  character
       (`:' or `(') may be specified by enclosing them in double quotes (`:' or `(').
       Example:

              Lexical  :  a b c d e f g h i j k l m n o p q r s t u v w x y z "-" "." "," "?" "!"
                     "\"" "'" ":" ";" "(" ")" strong_e

              Surface : a b c d e f g h i j k l m n o p q r s t u v w x y z "-" "." ","  "?"  "!"
                     "\"" "'" ":" ";" "(" ")" " "

       In  this  example,  the symbol strong_e is lexical only, the symbol " " (space) is surface
       only.  All the other symbols are bi-level.

       All the strings appearing in the rest of the grammar will be made exclusively  of  symbols
       declared in this section.

@ Attributes

       In  this  section, the name of attributes (sometimes called features) and their associated
       value set.  At most 32 different values may be declared for an attribute.
       Examples:

              Gender : feminine masculine neuter
              Number : singular plural
              Person : 1st 2nd 3rd
              Transitive : yes no
              Inflection : base intermediate final

       In the current version of the  implementation  value  sets  of  different  attributes  are
       incompatible,  even  if  they are defined identically.  To overcome this restriction, in a
       future version this section will be  split  into  two:   declaration  of  value  sets  and
       declaration of attributes.

@ Types

       In  this  section, the different types of feature structures are declared.  The attributes
       allowed for each type are listed.  Attributes that are only used within the scope  of  the
       tool  and  have  no  meaning outside can be listed after a bar (`|').  The values of these
       local attributes ar not stored in the database or written  on  the  final  output  of  the
       program.
       Examples:

              Noun : Gender Number
              Verb : Tense Person Gender Number Transitive | Inflection

Typed feature structures

       Typed  feature  structures  are  used  in  the  grammar  and  spelling  rules.   It is the
       specification of a type and  the  value  of  some  associated  attributes.   The  list  of
       attribute specifications is enclosed in square brackets (`[' and `]').
       Example:

              Noun[ Gender=feminine Number=singular ]

       It  is possible to specify a set of values for an attribute by listing the possible valuse
       separated with a bar (`|'), or the complement of a  set  (with  respect  to  all  possible
       values of that attribute) indicated with `!=' instead of `='.
       Example:   Assuming  the  declaration  of Gender as above, the following two typed feature
       structures are equivalent

              Noun[ Gender=masculine|neuter ]
              Noun[ Gender!=feminine ]

@ Grammar

       This section contains the rules that specify the structure of words.  It has  the  general
       shape  of  a  context  free  grammar over typed feature structures.  There are three basic
       types of rules:  binary, goal and affixes.

       Binary rules specify the result of the concatenation of two elements. This is written as:

              Rule_name : Lhs <- Rhs1 Rhs2

       where Lhs is called the left hand side, and Rhs1 and Rhs2 the first and second part of the
       right hand side.  Lhs, Rhs1 and Rhs2 are specified as typed feature structures.
       Example:

              Rule_1  : Noun[ Gender=feminine Number=singular ]
                      <- Noun[ Gender=feminine Number=singular ]
                         NounSuffix[ Gender=feminine ]

       Variables can be used to indicate that some attributes have the same value.  A variable is
       a name starting with a dollar (`$').
       Example:

              Rule_2  : Noun[ Gender=$A Number=$number ]
                      <- Noun[ Gender=$A Number=$number ]
                         NounSuffix[ Gender=$A ]

       If needed, both a variable and a value specification can be given for an  attribute  (only
       once per attribute):
       Example:

              Rule_3  : Noun[ Gender=$A Number=$number ]
                      <- Noun[ Gender=$A Number=$number ]
                         NounSuffix[ Gender=$A=masculine|neuter ]

       Affix  rules  define  basic  elements  of  the  concatenations  specified  by binary rules
       (together with lexical entries, see the section @ Lexicon below).  An affix rule  consists
       of lexical string associated to a typed feature structure.
       Examples:

              Plural_s : "s" NounSuffix[ Number=plural ]
              Feminine_e : "e" NounSuffix[ Gender=feminine ]
              ing : "ing" VerbSuffix[ Tense=present_participle ]

       Goal  rules  specify the valid results constructed by the grammar.  They consist of just a
       typed feature structure.
       Examples:

              Goal_1  : Noun[]
              Goal_2  : Verb[ inflection=final ]

       In addition to these three basic rule types, there are prefix or  suffix  composite  rules
       and unary rules.  A unary rule consist of a left hand side and a right hand side.
       Example:

              Rule_4  : Noun[ gender=$G number=plural ]
                      <- Noun[ gender=$G number=singular invariant=yes]

       Prefix and suffix composite rules have the same shape as binary rules except that one part
       of the right hand side is an affix (i.e. has an associated string).
       Examples:

              Append_e   : Noun[ Gender=feminine Number=$number ]
                      <- Noun[ Gender=feminine Number=$number ]
                         "e" NounSuffix[ Gender=feminine ]

              anti    : Noun[ Gender=$gender Number=$number ]
                      <- "anti" NounPrefix[]
                         Noun[ Gender=$gender Number=$number ]

@ Classes

       This optional section contains the definition of symbol classes. Each class is defined  as
       a  set  of symbols, or other classes. If the class contains only bi-level elements it is a
       bi-level class, otherwise it is a lexical or surface class.
       Examples:

              Dental : d t
              Vowel : a e i o u
              Vowel_y : Vowel y
              Consonant: b c d f g h j k l m n p q r s t v w x z

@ Pairs

       This optional section contains the definition of pair disjunctions.  Each  disjunction  is
       defined  as  a  set  of pairs.  Explicit pairs specify a sequence of surface symbols and a
       sequence of zero or one lexical symbol,  one  of  them  possibly  empty.   A  sequence  is
       enclosed  between  angle brackets `<' and `>'.  The empty sequence is indicated with `<>'.
       In the current implementation only the surface part of a pair can be a  sequence  of  more
       than  one  element.   The special symbol `?' stands for the class of all possible symbols,
       including the morpheme and word boundary.
       Examples:

              s_x_z_1 : s/s x/x z/z
              VowelPair1: a/a e/e i/i o/o u/u
              VowelPair2: Vowel/Vowel
              ie.y: <i e>/y
              Delete_e: <>/e
              Insert_d: d/<>
              Surface_Vowel: Vowel/?
              Lexical_s:  ?/s

              DoubleConsonant: <b b>/b <d d>/d <f f>/f <g g>/g <k k>/k  <m m>/m  <p p>/p  <s s>/s
                     <t t>/t <v v>/v <z z>/z

       Note  that  VowelPair1 and VowelPair2 don't specify the same thing: VowelPair2 would match
       a/o but VowelPair1 would not.

       Implicit pairs are specified by the name of a bi-level symbol or a bi-level class.
       Examples:  the following s_x_z_2 and VowelPair3 are equivalent to the  above  s_x_z_1  and
       VowelPair2 (assuming that s, x, z and Vowel are bi-level symbols and classes).

              s_x_z_2 : s x z
              VowelPair3 : Vowel

       In  a pair disjunction all lexical parts should be disjoint. This means you cannot specify
       for the same pair disjunction a/a and o/a or a/a and Vowel/Vowel.

       In a future version this section will be split in two:  simple pair disjunctions and  pair
       sequences.

@ Spelling

       In  this  section are declared the two level spelling rules.  A spelling rule consist of a
       kind indicator followed by a left context a focus and a right context.  The kind indicator
       is  `=>'  if  the  rule is optional, `<=>' if it is obligatory and `<=' if it is a surface
       coercion rule.  The contexts may be empty.  The focus  is  surrounded  by  two  `-'.   The
       contexts and the focus consist of a sequence of pairs or pair disjunctions declared in the
       `@ Pairs section.  A morpheme boundary is indicated by a `+' or a `*', a word boundary  is
       indicated by a `~'.
       Examples:

              Sibilant_s: <=> s_x_z_1 * - e/<> - s
              Gemination: <=>
                      Consonant Vowel - DoubleConsonant - * Vowel
              i_y_optionnel: => a - i/y - * ?/e

       Constraints  may be specified in the form of a list of typed feature structures.  They are
       affix-driven:  the rule is  licensed  if  at  least  one  of  them  subsumes  the  closest
       corresponding  affix.   The  morpheme  boundary  indicated by a star (`*') will be used to
       determine which affix it is.  If there is no such indication, then the affix  adjacent  to
       the  morpheme  where the first character of the focus occurs is used.  In case there is no
       affix, the typed feature structure of the lexical stem is used.
       Example:

              Sibilant_s: <=>
                  s_x_z_1 * - e/<> - s NounSuffix[ Number=plural ]

@ Lexicon

       This section is optional and can also be repeated.  This section  lists  all  the  lexical
       entries  of  the morphological description.  Unlike the other sections, definitions do not
       have a name.  A definition consist of a typed feature  strucure  followed  by  a  list  of
       lexical  stems  that  share that feature structure.  A lexical stem consists of the string
       used in the concatenation specified by the grammar rules followed by `=' and  a  reference
       string.   The  reference  string  can  be  anything  and  usually  is used to indicate the
       canonical form of the word or an identifier of an external database entry.
       Examples:
              Noun[ Number=singular ] "table" = "table" "chair" = "chair"
              Verb[ Transitive=yes|no Inflection=base ] "bow" = "bow1"
              Noun[ Number=singular ] "bow" = "bow2"

       If the stem string and  the  reference  strings  are  identical,  only  one  needs  to  be
       specified.
       Example:

              Noun[ Number=singular ] "table" "chair"

FORMAL SYNTAX

       The  formal  syntax  description  below  is  in  Backus  Naur  Form  (BNF).  The following
       conventions apply:

       <id>      is a non-terminal symbol (within angle brackets).
       ID        is a token (terminal symbol, all uppercase).
       <id>?     means zero or one occurrence of <id> (i.e. <id> is optional).
       <id>*     is zero or more occurrences of <id>.
       <id>+     is one or more occurrences of <id>.
       ::=       separates a non-terminal symbol and its expansion.
       |         indicates an alternative expansion.
       ;         starts a comment (not part of the definition).

       The start symbol corresponding to a complete description is named <Start>.   Symbols  that
       parse but do nothing are marked with `; not operational'.

       <Start>           ::= <AlphabetDecl> <AttDecl> <TypeDecl> <GramDecl>
                             <ClassDecl>? <PairDecl>? <SpellDecl>? <LexDecl>*

       <AlphabetDecl>    ::= ALPHABETS <LexicalDef> <SurfaceDef>

       <LexicalDef>      ::= <LexicalName> COLON <LexicalSymbol>+

       <SurfaceDef>      ::= <SurfaceName> COLON <SurfaceSymbol>+

       <LexicalSymbol>   ::= <LexicalSymbolName>    ; lexical only
                         |   <BiLevelSymbolName>    ; both lexical and surface

       <SurfaceSymbol>   ::= <SurfaceSymbolName>    ; surface only
                         |   <BiLevelSymbolName>    ; both lexical and surface

       <AttDecl>         ::= ATTRIBUTES <AttDef>+

       <AttDef>          ::= <AttName> COLON <ValName>+

       <TypeDecl>        ::= TYPES <TypeDef>+

       <TypeDef>         ::= <TypeName> COLON <AttName>+ <NoProjAtt>?

       <NoProjAtt>       ::= BAR <AttName>+

       <LexDecl>         ::= LEXICON <LexDef>+

       <LexDef>          ::= <Tfs> <Lexical>+

       <Lexical>         ::= LEXICALSTRING <BaseForm>?

       <BaseForm>        ::= EQUAL LEXICALSTRING

       <Tfs>             ::= <TypeName> <AttSpec>?

       <VarTfs>          ::= <TypeName> <VarAttSpec>?

       <AttSpec>         ::= LBRA <AttVal>* RBRA

       <VarAttSpec>      ::= LBRA <VarAttVal>* RBRA

       <AttVal>          ::= <AttName> <ValSpec>

       <VarAttVal>       ::= <AttName> <VarValSpec>

       <ValSpec>         ::= EQUAL <ValSet>
                         |   NOTEQUAL <ValSet>

       <VarValSpec>      ::= <ValSpec>
                         |   EQUAL DOLLAR <VarName>
                         |   EQUAL DOLLAR <VarName> <ValSpec>

       <ValSet>          ::= <ValName> <ValSetRest>*

       <ValSetRest>      ::= BAR <ValName>

       <GramDecl>        ::= GRAMMAR <Rule>+

       <RuleDef>         ::= <RuleName> COLON <RuleBody>

       <RuleBody>        ::= <VarTfs> LARROW <Rhs>
                         |   <Tfs>    ; goal rule
                         |   LEXICALSTRING <Tfs>    ; lexical affix

       <Rhs>             ::= <VarTfs>    ; unary rule
                         |   <VarTfs> <VarTfs>    ; binary rule
                         |   LEXICALSTRING <Tfs> <VarTfs>   ; prefix rule
                         |   <VarTfs> <Tfs> LEXICALSTRING    ; suffix rule

       <ClassDecl>       ::= CLASSES<ClassDef>+

       <ClassDef>        ::= <LexicalClassName> COLON <LexicalClass>+
                         |   <SurfaceClassName> COLON <SurfaceClass>+
                         |   <BiLevelClassName> COLON <BiLevelClass>+

       <LexicalClass>    ::= <LexicalSymbol>
                         |   <LexicalClassName>
                         |   <BiLevelClassName>

       <SurfaceClass>    ::= <SurfaceSymbol>
                         |   <SurfaceClassName>
                         |   <BiLevelClassName>

       <BiLevelClass>    ::= <BiLevelSymbolName>
                         |   <BiLevelClassName>

       <PairDecl>        ::= PAIRS <PairDef>+

       <PairDef>         ::= <PairName> COLON <PairDef>+

       <PairDef>         ::= <PairName> COLON <Pair>+

       <Pair>            ::= <SurfaceSequence> SLASH <LexicalSequence>
                         |   <PairName>
                         |   <BiLevelClassName>
                         |   <BiLevelSymbolName>

       SurfaceSequence   ::= LANGLE <SurfaceSymbol>* RANGLE
                         |   SURFACESTRING
                         |   <SurfaceClass>
                         |   ANY

       LexicalSequence   ::= LANGLE <LexicalSymbol>* RANGLE
                         |   LEXICALSTRING
                         |   <LexicalClass>
                         |   ANY

       <SpellDecl>       ::= SPELLING <SpellDef>+

       <SpellDef>        ::= <SpellName> COLON <Arrow> <LeftContext> <Focus>
                                 <RightContext> <Constraint>*

       <LeftContext>     ::= <Pattern>*

       <RightContext>    ::= <Pattern>*

       <Focus>           ::= CONTEXTBOUNDARY <Pattern>+ CONTEXTBOUNDARY

       <Pattern>         ::= <Pair>
                         |   MORPHEMEBOUNDARY
                         |   WORDBOUNDARY
                         |   CONCATBOUNDARY

       <Constraint>      ::= <Tfs>

       <Arrow>           ::= RARROW
                         |   BIARROW
                         |   COERCEARROW

       <AttName>           ::= NAME
       <BiLevelClassName>  ::= NAME
       <BiLevelSymbolName> ::= NAME  | SYMBOLSTRING
       <LexicalClassName>  ::= NAME
       <LexicalName>       ::= NAME
       <LexicalSymbolName> ::= NAME  | SYMBOLSTRING
       <PairName>          ::= NAME
       <RuleName>          ::= NAME
       <SpellName>         ::= NAME
       <SurfaceClassName>  ::= NAME
       <SurfaceName>       ::= NAME
       <SurfaceSymbolName> ::= NAME  | SYMBOLSTRING
       <TypeName>          ::= NAME
       <ValName>           ::= NAME
       <VarName>           ::= NAME

   Simple tokens
       Simple  tokens  of  the  BNF  above  are  defined  as  follow:  The token name on the left
       correspond to the literal character or characters on the right:

       ANY                 ?
       BAR                 |
       BIARROW             <=>
       COERCEARROW         <=
       COLON               :
       CONCATBOUNDARY      *
       CONTEXTBOUNDARY     -
       DOLLAR              $
       EQUAL               =
       LANGLE              <
       LARROW              <-
       LBRA                ]
       MORPHEMEBOUNDARY    +
       NOTEQUAL            !=
       RARROW              =>
       RANGLE              <
       RBRA                [
       SLASH               /
       WORDBOUNDARY        ~

       ALPHABETS           @Alphabets
       ATTRIBUTES          @Attributes
       CLASSES             @Classes
       GRAMMAR             @Grammar
       LEXICON             @Lexicon
       PAIRS               @Pairs
       SPELLING            @Spelling
       TYPES               @Types

       In the section header tokens above, spaces may separate the `@' from the reserved word.

   Complex tokens
       NAME
              is any sequence of letter, digit, underline (`_'), period (`.')
              Examples:
              category
              33
              Rule_9
              __2__
              Proper.Noun

       LEXICALSTRING
              is a string of lexical symbols

       SURFACESTRING
              is a string of surface symbols

       SYMBOLSTRING
              is a string of just just one character (used only in alphabet declaration).

       A string consist of zero or  more  characters  within  double  quotes  (`"').   Characters
       preceded  by  a  backslash  (`\')  are  escaped  (the  usual C escaping convention apply).
       Symbols that have a name longer than one character are represented  using  a  SGML  entity
       like notation: `&symbolname;'.  The maximum number of symbols in a string is 127.
       Examples:

              "table"
              ","
              ""
              "double quote is \" and backslash is \\"
              "&strong_e;"
              "escape like in C : \t is ASCII tab"
              "escape with octal code: \011 is ASCII tab"

       Tokens can be separated by one or many blanks or comments.
       A blank separator is space, tab or newline.
       A  comment  starts  with  a  semicolon  and  finishes at the next newline (except when the
       semicolon occurs in a string.

       Inclusion of files can be specified with the usual `#include' directive:
       Example:
              #include "verb.entries"

       will splice in the content of the file verb.entries at  the  point  where  this  directive
       occurs.

       The  `#'  should  be the first character on the line.  Tabs or spaces may separate `#' and
       `include'.  The file name must be quoted.  Only tabs or spaces may occur on  the  rest  of
       the line.  Inclusion can be nested up to 10 levels.

AUTHOR

       Dominique Petitpierre, ISSCO, <petitp@divsun.unige.ch>

COMMENTS

       The parser for the morphology description formalims above was written using yacc  (1)  and
       flex  (1).   Flex was written by Vern Paxson, <vern@ee.lbl.gov>, and is distributed in the
       framework of the GNU project under the condition of the GNU General Public License

                                    Version 2.3, October 1995                           MMORPH(5)

NAME

DESCRIPTION

@ Alphabets

@ Attributes

@ Types

Typed feature structures

@ Grammar

@ Classes

@ Pairs

@ Spelling

@ Lexicon

FORMAL SYNTAX

SEE ALSO

AUTHOR

COMMENTS