Provided by: tcllib_1.20+dfsg-1_all bug

NAME

       grammar::peg - Create and manipulate parsing expression grammars

SYNOPSIS

       package require Tcl  8.4

       package require snit

       package require grammar::peg  ?0.1?

       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?

       pegName destroy

       pegName clear

       pegName = srcPEG

       pegName --> dstPEG

       pegName serialize

       pegName deserialize serialization

       pegName is valid

       pegName start ?pe?

       pegName nonterminals

       pegName nonterminal add nt pe

       pegName nonterminal delete nt1 ?nt2 ...?

       pegName nonterminal exists nt

       pegName nonterminal rename nt ntnew

       pegName nonterminal mode nt ?mode?

       pegName nonterminal rule nt

       pegName unknown nonterminals

_________________________________________________________________________________________________

DESCRIPTION

       This  package provides a container class for parsing expression grammars (Short: PEG).  It
       allows the incremental definition of the grammar, its manipulation  and  querying  of  the
       definition.   The  package  neither provides complex operations on the grammar, nor has it
       the ability to execute a grammar definition for a stream of symbols.  Two packages related
       to  this one are grammar::mengine and grammar::peg::interpreter. The first of them defines
       a general virtual machine  for  the  matching  of  a  character  stream,  and  the  second
       implements an interpreter for parsing expression grammars on top of that virtual machine.

   TERMS & CONCEPTS
       PEGs  are  similar  to  context-free  grammars, but not equivalent; in some cases PEGs are
       strictly more powerful than context-free grammars (there exist PEGs for some  non-context-
       free  languages).   The  formal mathematical definition of parsing expressions and parsing
       expression grammars can be found in section PARSING EXPRESSION GRAMMARS.

       In short, we have  terminal  symbols,  which  are  the  most  basic  building  blocks  for
       sentences,  and  nonterminal  symbols  with  associated  parsing expressions, defining the
       grammatical structure of the sentences. The two sets of symbols are  distinctive,  and  do
       not overlap. When speaking about symbols the word "symbol" is often left out. The union of
       the sets of terminal and nonterminal symbols is called the set of symbols.

       Here the set of terminal symbols is not explicitly managed, but implicitly defined as  the
       set of all characters. Note that this means that we inherit from Tcl the ability to handle
       all of Unicode.

       A pair of nonterminal and parsing expression is also called a grammatical  rule,  or  rule
       for  short.  In  the  context of a rule the nonterminal is often called the left-hand-side
       (LHS), and the parsing expression the right-hand-side (RHS).

       The start expression of a grammar is a parsing expression from  which  all  the  sentences
       contained in the language specified by the grammar are derived.  To make the understanding
       of this term easier let us assume for a moment that the RHS of each rule,  and  the  start
       expression, is either a sequence of symbols, or a series of alternate parsing expressions.
       In the latter case the rule can be seen as a set of rules, each providing one  alternative
       for  the nonterminal.  A parsing expression A' is now a derivation of a parsing expression
       A if we pick one of the nonterminals N in the expression, and one of the alternative rules
       R  for  N,  and then replace the nonterminal in A with the RHS of the chosen rule. Here we
       can see why the terminal symbols are called such. They cannot  be  expanded  any  further,
       thus terminate the process of deriving new expressions.  An example

                  Rules
                    (1)  A <- a B c
                    (2a) B <- d B
                    (2b) B <- e

                  Some derivations, using starting expression A.

                    A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c

       A  derived  expression  containing  only  terminal  symbols  is a sentence. The set of all
       sentences which can be derived from the start expression is the language of the grammar.

       Some definitions for nonterminals and expressions:

       [1]    A nonterminal A is  called  reachable  if  it  is  possible  to  derive  a  parsing
              expression from the start expression which contains A.

       [2]    A nonterminal A is called useful if it is possible to derive a sentence from it.

       [3]    A  nonterminal  A  is  called  recursive  if  it  is  possible  to derive a parsing
              expression from it which contains A, again.

       [4]    The FIRST set of a nonterminal A contains all the symbols which can occur of as the
              leftmost symbol in a parsing expression derived from A. If the FIRST set contains A
              itself then that nonterminal is called left-recursive.

       [5]    The LAST set of a nonterminal A contains all the symbols which can occur of as  the
              rightmost symbol in a parsing expression derived from A. If the LAST set contains A
              itself then that nonterminal is called right-recursive.

       [6]    The FOLLOW set of a nonterminal A contains all the symbols which can occur after  A
              in a parsing expression derived from the start expression.

       [7]    A  nonterminal (or parsing expression) is called nullable if the empty sentence can
              be derived from it.

       And based on the above definitions for grammars:

       [1]    A grammar G is recursive if and only if  it  contains  a  nonterminal  A  which  is
              recursive. The terms left- and right-recursive, and useful are analogously defined.

       [2]    A grammar is minimal if it contains only reachable and useful nonterminals.

       [3]    A  grammar  is  wellformed  if  it  is  not  left-recursive. Such grammars are also
              complete, which means that they always succeed or fail on all input sentences.  For
              an  incomplete grammar on the other hand input sentences exist for which an attempt
              to match them against the grammar will not terminate.

       [4]    As we wish to allow ourselves to build  a  grammar  incrementally  in  a  container
              object  we  will  encounter  stages  where  the  RHS of one or more rules reference
              symbols which are not yet known to the container. Such a grammar we  call  invalid.
              We cannot use the term incomplete as this term is already taken, see the last item.

   CONTAINER CLASS API
       The package exports the API described here.

       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
              The  command  creates  a  new container object for a parsing expression grammar and
              returns the fully qualified name of the object command as its result. The  API  the
              returned  command is following is described in the section CONTAINER OBJECT API. It
              may be used to invoke various operations on the container and the grammar within.

              The new container, i.e. grammar will be empty if no src is specified. Otherwise  it
              will  contain  a  copy  of  the  grammar contained in the src.  The src has to be a
              container object reference for all operators except deserialize.   The  deserialize
              operator  requires  src  to  be  the  serialization of a parsing expression grammar
              instead.

              An empty grammar has no nonterminal symbols, and the start expression is the  empty
              expression, i.e. epsilon. It is valid, but not useful.

   CONTAINER OBJECT API
       All  grammar container objects provide the following methods for the manipulation of their
       contents:

       pegName destroy
              Destroys the grammar, including its storage space and associated command.

       pegName clear
              Clears out the definition of the grammar contained in pegName, but does not destroy
              the object.

       pegName = srcPEG
              Assigns the contents of the grammar contained in srcPEG to pegName, overwriting any
              existing definition.  This is the assignment operator for grammars. It  copies  the
              grammar  contained  in  the  grammar  object  srcPEG over the grammar definition in
              pegName. The old contents of pegName are deleted by this operation.

              This operation is in effect equivalent to

                  pegName deserialize [srcPEG serialize]

       pegName --> dstPEG
              This is the reverse assignment operator for  grammars.  It  copies  the  automation
              contained  in  the object pegName over the grammar definition in the object dstPEG.
              The old contents of dstPEG are deleted by this operation.

              This operation is in effect equivalent to

                  dstPEG deserialize [pegName serialize]

       pegName serialize
              This method serializes the grammar stored in pegName. In other words it  returns  a
              tcl  value  completely  describing  that  grammar.   This  allows, for example, the
              transfer of grammars over arbitrary channels, persistence,  etc.   This  method  is
              also the basis for both the copy constructor and the assignment operator.

              The result of this method has to be semantically identical over all implementations
              of the grammar::peg interface. This is what will enable us to copy grammars between
              different implementations of the same interface.

              The result is a list of four elements with the following structure:

              [1]    The constant string grammar::peg.

              [2]    A  dictionary.  Its keys are the names of all known nonterminal symbols, and
                     their  associated  values  are  the  parsing  expressions  describing  their
                     sentennial structure.

              [3]    A  dictionary.  Its keys are the names of all known nonterminal symbols, and
                     their associated values hints to a matcher  regarding  the  semantic  values
                     produced by the symbol.

              [4]    The last item is a parsing expression, the start expression of the grammar.

       Assuming the following PEG for simple mathematical expressions

                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
                  Sign       <- '+' / '-'
                  Number     <- Sign? Digit+
                  Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
                  MulOp      <- '*' / '/'
                  Factor     <- Term (AddOp Term)*
                  AddOp      <- '+'/'-'
                  Term       <- Number

       a possible serialization is

                  grammar::peg \
                  {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \
                   Factor     {x Term {* {x AddOp Term}}} \
                   Term       Number \
                   MulOp      {/ * /} \
                   AddOp      {/ + -} \
                   Number     {x {? Sign} {+ Digit}} \
                   Sign       {/ + -} \
                   Digit      {/ 0 1 2 3 4 5 6 7 8 9} \
                  } \
                  {Expression value     Factor     value \
                   Term       value     MulOp      value \
                   AddOp      value     Number     value \
                   Sign       value     Digit      value \
                  }
                  Expression

       A possible one, because the order of the nonterminals in the dictionary is not relevant.

       pegName deserialize serialization
              This  is the complement to serialize. It replaces the grammar definition in pegName
              with the grammar described by the serialization value. The old contents of  pegName
              are deleted by this operation.

       pegName is valid
              A  predicate.  It  tests  whether the PEG in pegName is valid.  See section TERMS &
              CONCEPTS for the definition of this grammar property.   The  result  is  a  boolean
              value.  It  will  be  set  to  true  if  the PEG has the tested property, and false
              otherwise.

       pegName start ?pe?
              This method defines the start expression of the grammar. It replaces the previously
              defined  start  expression  with  the  parsing expression pe.  The method fails and
              throws an error if pe does not contain a valid parsing expression as  specified  in
              the  section PARSING EXPRESSIONS. In that case the existing start expression is not
              changed.  The method returns the empty string as its result.

              If the method is called without an argument it will return  the  currently  defined
              start expression.

       pegName nonterminals
              Returns the set of all nonterminal symbols known to the grammar.

       pegName nonterminal add nt pe
              This method adds the nonterminal nt and its associated parsing expression pe to the
              set of nonterminal symbols and rules of the PEG contained in  the  object  pegName.
              The  method fails and throws an error if either the string nt is already known as a
              symbol of the grammar, or if pe does not contain  a  valid  parsing  expression  as
              specified  in  the  section  PARSING  EXPRESSIONS.  In that case the current set of
              nonterminal symbols and rules is not changed.  The method returns the empty  string
              as its result.

       pegName nonterminal delete nt1 ?nt2 ...?
              This  method removes the named symbols nt1, nt2 from the set of nonterminal symbols
              of the PEG contained in the object pegName.  The method fails and throws  an  error
              if  any  of  the  strings  is  not  known as a nonterminal symbol. In that case the
              current set of nonterminal symbols is not changed.  The method  returns  the  empty
              string as its result.

              The  stored  grammar  becomes invalid if the deleted nonterminals are referenced by
              the RHS of still-known rules.

       pegName nonterminal exists nt
              A predicate. It tests whether the nonterminal symbol nt is  known  to  the  PEG  in
              pegName.  The result is a boolean value. It will be set to true if the symbol nt is
              known, and false otherwise.

       pegName nonterminal rename nt ntnew
              This method renames the nonterminal symbol nt  to  ntnew.   The  method  fails  and
              throws  an error if either nt is not known as a nonterminal, or if ntnew is a known
              symbol.  The method returns the empty string as its result.

       pegName nonterminal mode nt ?mode?
              This mode returns or sets the semantic mode associated with the nonterminal  symbol
              nt.  If  no  mode  is  specified  the  current mode of the nonterminal is returned.
              Otherwise the current mode is set to mode.  The method fails and throws an error if
              nt  is  not  known  as  a  nonterminal.  The grammar interpreter implemented by the
              package grammar::peg::interpreter recognizes the following modes:

              value  The semantic value of the nonterminal is the abstract  syntax  tree  created
                     from the AST's of the RHS and a node for the nonterminal itself.

              match  The  semantic  value  of  the  nonterminal  is  an  the abstract syntax tree
                     consisting of single a node for the string matched  by  the  RHS.  The  ASTs
                     generated by the RHS are discarded.

              leaf   The  semantic  value  of  the  nonterminal  is  an  the abstract syntax tree
                     consisting of single a node for the nonterminal itself. The  ASTs  generated
                     by the RHS are discarded.

              discard
                     The  nonterminal  has  no  semantic value. The ASTs generated by the RHS are
                     discarded (as well).

       pegName nonterminal rule nt
              This method returns the parsing expression associated with the nonterminal nt.  The
              method fails and throws an error if nt is not known as a nonterminal.

       pegName unknown nonterminals
              This  method  returns  a list containing the names of all nonterminal symbols which
              are referenced on the RHS of a grammatical rule, but have no rule definining  their
              structure. In other words, a list of the nonterminal symbols which make the grammar
              invalid. The grammar is valid if this list is empty.

   PARSING EXPRESSIONS
       Various methods of PEG container objects expect a parsing expression as their argument, or
       will return such. This section specifies the format such parsing expressions are in.

       [1]    The string epsilon is an atomic parsing expression. It matches the empty string.

       [2]    The  string  alnum  is  an  atomic  parsing expression. It matches any alphanumeric
              character.

       [3]    The string alpha is an atomic  parsing  expression.  It  matches  any  alphabetical
              character.

       [4]    The string dot is an atomic parsing expression. It matches any character.

       [5]    The  expression [list t x] is an atomic parsing expression. It matches the terminal
              string x.

       [6]    The expression [list  n  A]  is  an  atomic  parsing  expression.  It  matches  the
              nonterminal A.

       [7]    For  parsing expressions e1, e2, ... the result of [list / e1 e2 ... ] is a parsing
              expression as well.  This is the ordered choice, aka prioritized choice.

       [8]    For parsing expressions e1, e2, ... the result of [list x e1 e2 ... ] is a  parsing
              expression as well.  This is the sequence.

       [9]    For  a  parsing  expression  e  the result of [list * e] is a parsing expression as
              well.  This is the kleene closure, describing zero or more repetitions.

       [10]   For a parsing expression e the result of [list + e]  is  a  parsing  expression  as
              well.  This is the positive kleene closure, describing one or more repetitions.

       [11]   For  a  parsing  expression  e  the result of [list & e] is a parsing expression as
              well.  This is the and lookahead predicate.

       [12]   For a parsing expression e the result of [list ! e]  is  a  parsing  expression  as
              well.  This is the not lookahead predicate.

       [13]   For  a  parsing  expression  e  the result of [list ? e] is a parsing expression as
              well.  This is the optional input.

       Examples of parsing expressions where already shown, in  the  description  of  the  method
       serialize.

PARSING EXPRESSION GRAMMARS

       For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where

       •      VN is a set of nonterminal symbols,

       •      VT is a set of terminal symbols,

       •      R  is  a  finite  set  of  rules, where each rule is a pair (A,e), A in VN, and e a
              parsing expression.

       •      eS is a parsing expression, the start expression.

       Further constraints are

       •      The intersection of VN and VT is empty.

       •      For all A in VT exists exactly one pair (A,e) in R. In other words, R is a function
              from nonterminal symbols to parsing expressions.

       Parsing expression are inductively defined via

       •      The empty string (epsilon) is a parsing expression.

       •      A terminal symbol a is a parsing expression.

       •      A nonterminal symbol A is a parsing expression.

       •      e1e2  is  a  parsing  expression  for  parsing expressions e1 and 2. This is called
              sequence.

       •      e1/e2 is a parsing expression for parsing expressions e1  and  2.  This  is  called
              ordered choice.

       •      e*  is  a  parsing expression for parsing expression e. This is called zero-or-more
              repetitions, also known as kleene closure.

       •      e+ is a parsing expression for parsing expression e.  This  is  called  one-or-more
              repetitions, also known as positive kleene closure.

       •      !e  is  a  parsing  expression  for  parsing  expression  e1.  This is called a not
              lookahead predicate.

       •      &e is a parsing expression for  parsing  expression  e1.  This  is  called  an  and
              lookahead predicate.

       PEGs are used to define a grammatical structure for streams of symbols over VT. They are a
       modern phrasing of older formalisms invented by Alexander Birham.  These  formalisms  were
       called  TS  (TMG recognition scheme), and gTS (generalized TS). Later they were renamed to
       TPDL (Top-Down Parsing Languages) and gTPDL (generalized TPDL).

       They can be easily implemented by recursive descent parsers with backtracking. This  makes
       them relatives of LL(k) Context-Free Grammars.

REFERENCES

       [1]    The     Packrat     Parsing     and     Parsing     Expression     Grammars    Page
              [http://www.pdos.lcs.mit.edu/~baford/packrat/],  by   Bryan   Ford,   Massachusetts
              Institute of Technology. This is the main entry page to PEGs, and their realization
              through Packrat Parsers.

       [2]    Parsing Techniques - A Practical Guide  [http://www.cs.vu.nl/~dick/PTAPG.html],  an
              online book offering a clear, accessible, and thorough discussion of many different
              parsing techniques with their interrelations and applicabilities,  including  error
              recovery techniques.

       [3]    Compilers  and  Compiler  Generators [http://scifac.ru.ac.za/compilers/], an online
              book using CoCo/R, a generator for recursive descent parsers.

BUGS, IDEAS, FEEDBACK

       This document, and the package it describes,  will  undoubtedly  contain  bugs  and  other
       problems.   Please  report  such  in  the  category  grammar_peg  of  the  Tcllib Trackers
       [http://core.tcl.tk/tcllib/reportlist].  Please also report any ideas for enhancements you
       may have for either package and/or documentation.

       When proposing code changes, please provide unified diffs, i.e the output of diff -u.

       Note further that attachments are strongly preferred over inlined patches. Attachments can
       be made by going to the Edit form of the ticket immediately after its creation,  and  then
       using the left-most button in the secondary navigation bar.

KEYWORDS

       LL(k),  TDPL,  context-free  languages,  expression, grammar, parsing, parsing expression,
       parsing expression grammar,  push  down  automaton,  recursive  descent,  state,  top-down
       parsing languages, transducer

CATEGORY

       Grammars and finite automata

COPYRIGHT

       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>