Provided by: libmarpa-r2-perl_2.086000~dfsg-6build2_amd64 bug

NAME

       Marpa::R2::Vocabulary - Standard parsing terms as used within Marpa

Description

       The definitions in this document are of standard parsing terms as they are used in the
       Marpa context.  I put defining uses of terms in boldface, for easy skimming.  Where a
       narrow or specialized sense of the term is the one that applies within Marpa, that is the
       only definition given.  Marpa also sometimes uses a standard term with a definition which
       is slightly different from the standard one.  ("Ambiguous grammar" is one example, and
       "grammar" itself is another.)  When this is the case, it is explicitly pointed out.

       A reader totally new to parsing will find this document too terse to act as a textbook or
       tutorial.  They will need to look elsewhere first.  As an introduction, I recommend Mark
       Jason Dominus's excellent chapter on parsing in the Perl context.  It's available on-line.
       Wikipedia is also an excellent place to start.

   Basic
       A grammar is a set of rules, associated with a set of symbols, one of which is
       distinguished as the start symbol.  A symbol string, or simply string where the meaning is
       clear, is an ordered series of symbols.  The length of a string is the number of symbols
       in it.

       A language is a set of symbol strings.  A grammar defines a language, as will be described
       later.

       It is important to note that the term language, as it is used in parsing theory, means
       something very different from what it means in ordinary use.  The meaning of the strings
       is an essential part of the ordinary idea of what a language is.  In ordinary use, the
       word "language" means far more than a unordered list of its sentences.  In parsing
       terminology, meaning (or semantics as it is called) is a separate issue.  For parsing
       theory a language is exactly a set of strings -- that and nothing more.

       The Marpa definition of a grammar differs slightly from the various standard ones.
       Standard definitions usually sharply distinguish terminal symbols from non-terminals.
       Marpa does not.

   Stages of parsing
       A recognizer is a program that determines whether its input is in the language of a
       grammar and a start symbol.  A parser is a program which finds the structure of that
       input.

       The term parsing is used in a strict and a loose sense.  Parsing in the loose sense is all
       phases of finding a grammar's structure, including a separate recognition phase if the
       parser has one.  (Marpa does.)  If a parser has phases, parsing in the strict sense refers
       specifically to the phase that finds the structure of the input.  When the Marpa documents
       use the term parsing in its strict sense, they will speak explicitly of "parsing in the
       strict sense".  Otherwise, parsing will mean parsing in the loose sense.

       Parsers often use a lexical analyzer to convert raw input, usually input text, into a
       token stream, which is a series of tokens.  Each token represents a symbol of the grammar
       and has a value.  A lexical analyzer is often called a lexer or a scanner, and lexical
       analysis is often called lexing or scanning.

       The series of symbols represented by the series of tokens becomes the symbol string input
       seen by the recognizer.  The symbol string input is more often called the input sentence.

       By default, Marpa uses the token stream model of input.  Marpa also allows alternative
       input models.  These are new to Marpa, so that their terminology is of necessity non-
       standard.  The terminology needed for alternative input models is explained in the
       document that introduces them.

   Rules
       A standard way of describing rules is Backus-Naur Form, or BNF.  A rule of a grammar is
       also often called a production.  In one common way of writing BNF, a production looks like
       this:

           Expression ::= Term Factor

       In the production above, "Expression", "Term" and "Factor" are symbols.  A production
       consists of a left hand side and a right hand side.  In a context-free grammar, like those
       Marpa parses, the left hand side of a production is always a symbol string of length 1.
       The right hand side of a production is a symbol string of zero or more symbols.  In the
       example, "Expression" is the left hand side, and "Term" and "Factor" are right hand side
       symbols.

       Left hand side and right hand side are often abbreviated as RHS and LHS.  If the RHS of a
       production has no symbols, the production is called an empty production or an empty rule.

       Any symbol which is allowed to occur in the symbol string input is called a terminal
       symbol.  If the symbols in a symbol string are all terminals, that symbol string is also
       called a sentence.

   Derivations
       A step of a derivation, or derivation step, is a change made to a symbol string by
       applying one of the productions from the grammar.  The production must be one of those
       with a LHS that occurs in the symbol string.  The result of the derivation step is another
       symbol string, one in which every occurence of the LHS symbol from the production is
       replaced by the RHS of the production.  For example, if "A", "B", "C", "D", and "X" are
       symbols, and

           X ::= B C

       is a production, then

           A X D -> A B C D

       is a derivation step, with ""A X D"" as its beginning and ""A B C D"" as its end or
       result.  We say that the symbol string ""A X D"" derives the symbol string ""A B C D"".

       A derivation is a sequence of derivation steps.  The length of a derivation is its length
       in steps.

       •   We say that a first symbol string directly derives a second symbol string if and only
           if there is a derivation of length 1 from the first symbol string to the second symbol
           string.

       •   Every symbol string is said to derive itself in a derivation of length 0.  A zero
           length derivation is a trivial derivation.

       •   A derivation which is not trivial (that is, a derivation which has one or more steps)
           is a non-trivial derivation.

       •   A non-trivial derivation of a symbol string from itself is called a cycle.  Grammars
           which contain cycles are traditionally considered useless, but Marpa will parse with
           such grammars.

       •   If a derivation is not trivial or direct, that is, if it has more than one step, then
           it is an indirect derivation.

       Technically, a symbol "X" and a string that consists of only that symbol are two different
       things.  But we often say "the symbol "X"" as shorthand for "the string of length 1 whose
       only symbol is "X"".  For example, if the string containing only the symbol "X" derives a
       string "Y", we will usually say simply that ""X" derives "Y"".

       Wherever symbol or string "X" derives "Y", we may also say "X" produces "Y".  Derivations
       are often described as symbol matches.  Wherever symbol or string "X" derives "Y", we may
       also say that "Y" matches "X" or that "X" matches "Y".  It is particularly common to say
       that "X" matches "Y" when "X" or "Y" is a sentence.

       The parse of an input by a grammar is successful if and only if, according to the grammar,
       the start symbol produces the input sentence.  The set of all input sentences that a
       grammar will successfully parse is the language of the grammar.

   Nulling
       The zero length symbol string is called the empty string.  The empty string can be
       considered to be a sentence, in which case it is the empty sentence.  A string of one or
       more symbols is non-empty.  A derivation which produces the empty string is a null
       derivation.  A derivation from the start symbol which produces the empty string is a null
       parse.

       If in a particular grammar, a symbol has a null derivation, it is a nullable symbol.  If,
       in a particular grammar, the only sentence produced by a symbol is the empty sentence, it
       is a nulling symbol.  All nulling symbols are nullable symbols.

       If a symbol is not nullable, it is non-nullable.  If a symbol is not nulling, it is non-
       nulling.  In any instance where a symbol produces the empty string, it is said to be
       nulled, or to be a null symbol.

   Useless rules
       If any derivation from the start symbol uses a rule, that rule is called reachable or
       accessible.  A rule that is not accessible is called unreachable or inaccessible.  If any
       derivation which results in a sentence uses a rule, that rule is said to be productive.  A
       rule that is not productive is called unproductive.  For example, a rule is unproductive
       unless every symbol on its RHS either is a terminal or is the LHS of some other rule.  A
       rule which is inaccessible or unproductive is called a useless rule.  Marpa can handle
       grammars with useless rules.

       A symbol is reachable or accessible if it appears in a reachable production.  If a symbol
       is not reachable, it is unreachable or inaccessible.  A symbol is productive if it appears
       on the LHS of a productive rule, or if it is a nullable symbol.  If a symbol is not
       productive, it is unproductive.  A symbol which is inaccessible or unproductive is called
       a useless symbol.  Marpa can handle grammars with useless symbols.

   Recursion and cycles
       If any symbol in the grammar non-trivially produces a symbol string containing itself, the
       grammar is said to be recursive.  If any symbol non-trivially produces a symbol string
       with itself on the left, the grammar is said to be left-recursive.  If any symbol non-
       trivially produces a symbol string with itself on the right, the grammar is said to be
       right-recursive.  Marpa can handle all recursive grammars, including grammars which are
       left-recursive, grammars which are right-recursive, and grammars which contain both left-
       and right-recursion.

       A cycle is a non-trivial derivation of a string of symbols from itself.  If it is not
       possible for any derivation using a grammar to contain a cycle, then that grammar is said
       to be cycle-free.  Traditionally, a grammar is considered useless if it is not cycle-free.

       The traditional deprecation of cycles is well-founded.  A cycle is the parsing equivalent
       of an infinite loop.  Once a cycle appears, it can be repeated over and over again.  Even
       a very short input sentence can have an infinite number of parses when the grammar is not
       cycle-free.

       For that reason, a grammar which contains a cycle is also called infinitely ambiguous.
       Marpa can parse with grammars which are not cycle-free, and will even parse inputs that
       cause cycles.  When a parse is infinitely ambiguous, Marpa limits cycles to a single loop,
       so that only a finite number of parses is returned.

   Parse structure
       The structure of a parse can be represented as a series of derivation steps from the start
       symbol to the input.  Another way to represent structure is as a parse tree.  Every symbol
       used in the parse is represented by a node of the parse tree.  Wherever a production is
       used in the parse, its LHS is represented by a parent node and the RHS symbols are
       represented by child nodes.  The start symbol is the root of the tree.  The node at the
       root of the tree is called the start node.

       Traditionally, grammars divide all symbols sharply into terminals and non-terminals.  A
       terminal symbol must ALWAYS be used as a terminal.  A non-terminal symbol can NEVER be
       used as a terminal.

       Marpa's use of terminals is non-traditional, and its terminology is different accordingly.
       As in the traditional approach, Marpa's non-terminals can never be used as terminals.  But
       Marpa terminals can be used anywhere, even in places where the traditional approach
       requires a a non-terminal symbol.  In particular, a Marpa terminal can be the LHS of a
       rule.

       Traditionally, and in Marpa as well, every node is either a inner node or a leaf node.  In
       Marpa, leaf nodes are of two kinds:

       •   Nodes for nulled symbols.  A node for a nulled symbol is called a nulled node.

       •   Nodes for symbols being used as terminals.

   Ambiguity
       Marpa allows ambiguous grammars.  Traditionally we say that a parse is ambiguous if, for a
       given grammar and a given input, more than one derivation tree is possible.  However,
       Marpa allows ambiguous input tokens, which the traditional definition does not take into
       account.  If Marpa used the traditional definition, all grammars would be ambiguous except
       those grammars which allowed only the null parse.

       It is easiest if the Marpa definition and the traditional definition were extensionally
       equivalent --- that is, if Marpa's set of ambiguous grammars was exactly the same as the
       set of traditionally ambiguous grammars.  This can be accomplished by using a slightly
       altered definition.  In the Marpa context, a grammar is ambiguous if and only if, for some
       UNAMBIGUOUS stream of input tokens, that grammar produces more than one parse tree.

   Semantics
       In real life, the structure of a parse is usually a means to an end.  Grammars usually
       have a semantics associated with them, and what the user actually wants is the value of
       the parse according to the semantics.

       The tree representation is especially useful when evaluating a parse.  In the traditional
       method of evaluating a parse tree, every node which represents a terminal symbol has a
       value associated with it on input.  Non-null inner nodes take their semantics from the
       production whose LHS they represent.  Nulled nodes are dealt with as special cases.

       The semantics for a production describe how to calculate the value of the node which
       represents the LHS (the parent node) from the values of zero or more of the nodes which
       represent the RHS symbols (child nodes).  Values are computed recursively, bottom-up.  The
       value of a parse is the value of its start symbol.

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.