Ubuntu Manpage: Marpa::R2::Scanless

Provided by: libmarpa-r2-perl_2.086000~dfsg-5build1_amd64

Name

       Marpa::R2::Scanless - Scanless interface

Synopsis

           use Marpa::R2;

           my $grammar = Marpa::R2::Scanless::G->new(
               {   bless_package => 'My_Nodes',
                   source        => \(<<'END_OF_SOURCE'),
           :default ::= action => [values] bless => ::lhs
           lexeme default = action => [ start, length, value ]
               bless => ::name latm => 1

           :start ::= Script
           Script ::= Expression+ separator => comma
           comma ~ [,]
           Expression ::=
               Number bless => primary
               | '(' Expression ')' bless => paren assoc => group
              || Expression '**' Expression bless => exponentiate assoc => right
              || Expression '*' Expression bless => multiply
               | Expression '/' Expression bless => divide
              || Expression '+' Expression bless => add
               | Expression '-' Expression bless => subtract

           Number ~ [\d]+
           :discard ~ whitespace
           whitespace ~ [\s]+
           # allow comments
           :discard ~ <hash comment>
           <hash comment> ~ <terminated hash comment> | <unterminated
              final hash comment>
           <terminated hash comment> ~ '#' <hash comment body> <vertical space char>
           <unterminated final hash comment> ~ '#' <hash comment body>
           <hash comment body> ~ <hash comment char>*
           <vertical space char> ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
           <hash comment char> ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
           END_OF_SOURCE
               }
           );

           my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );

           my $input = '42*2+7/3, 42*(2+7)/3, 2**7-3, 2**(7-3)';
           $recce->read(\$input);
           my $value_ref = $recce->value();
           die "No parse was found\n" if not defined $value_ref;

           # Result will be something like "86.33... 126 125 16"
           # depending on the floating point precision
           my $result = ${$value_ref}->doit();

           package My_Nodes;

           sub My_Nodes::primary::doit { return $_[0]->[0]->doit() }
           sub My_Nodes::Number::doit  { return $_[0]->[2] }
           sub My_Nodes::paren::doit   { my ($self) = @_; $self->[1]->doit() }

           sub My_Nodes::add::doit {
               my ($self) = @_;
               $self->[0]->doit() + $self->[2]->doit();
           }

           sub My_Nodes::subtract::doit {
               my ($self) = @_;
               $self->[0]->doit() - $self->[2]->doit();
           }

           sub My_Nodes::multiply::doit {
               my ($self) = @_;
               $self->[0]->doit() * $self->[2]->doit();
           }

           sub My_Nodes::divide::doit {
               my ($self) = @_;
               $self->[0]->doit() / $self->[2]->doit();
           }

           sub My_Nodes::exponentiate::doit {
               my ($self) = @_;
               $self->[0]->doit()**$self->[2]->doit();
           }

           sub My_Nodes::Script::doit {
               my ($self) = @_;
               return join q{ }, map { $_->doit() } @{$self};
           }

About this document

       This document is an introduction and overview to Marpa's Scanless interface (SLIF).
       Marpa::R2's top-level page has an SLIF tutorial in Marpa's top-level page.  If you are new
       to Marpa or its SLIF, you probably want to start with that.

       This document follows up on the tutorial, looking more deeply and carefully at the
       concepts behind the SLIF.  Separate documents provide the reference documentation for
       Scanless grammar objects, Scanless recognizer objects and the Scanless DSL.

The two levels of language description

Programmers usually describe the syntax of a language at two levels. The same two-level
approach can be convenient for implementing a parser of the language. But, implementation
aside, a two-level description seems to be a natural approach to the design issues that
arise in languages intended for practical use.

The first level is structural. For example, here is how the Perl docs describe one of the
forms that Perl's "use" statement takes:

use Module VERSION LIST

and in Perl's source code ("perly.y") something similar drives the parser.

The second level is lexical. For example, Perl's perlpodspec page has a number of
statements like this:

[...] you can distinguish URL-links from anything else
by the fact that they match m/\A\w+:[^:\s]\S*\z/.

The lexical level is character by character. The structural level is less well-defined,
but in practice it ignores most of the character-by-character issues, and it almost always
avoids dealing with whitespace.

For reasons that will become clear later, I will sometimes call the lexical level, L0, and
will sometimes call the structural level, G1. (For historic reasons, L0 is sometimes also
called G0.)

It is important to realize that the difference between L0 and G1 is one of level of
description and NOT one of precision or exactness. A structural description of Perl's
"use" statement, much like the one I showed above, is in Perl's source code ("perly.y"),
along with many other, similar, structural-level descriptions. These are used to generate
the production parser for Perl so, clearly, structural level descriptions are every bit as
much a precision instrument as regular expressions.

A very simple language

       In order to focus on very basic issues, I will use as an example, a very simple language
       with a very simple semantics.  The language consists of decimal digits and ASCII spaces.
       The semantics will treat it as a series of integers to be added.

       Here are three strings in that language

            8675311
            867 5311
            8 6 7 5 3 1 1

       According to our semantics, the three strings contain respectively, one, two and seven
       integers.  The values of the three strings are, according to our semantics, the sum of
       these integers: respectively, 8675311, 6178, and 31.

       It's sometimes said, in describing languages like the above, that "whitespace is ignored".
       From the purely structural point of view this can be, in one sense, true.  But from the
       lexical point of view it's clearly quite false.

       Combining the two levels of description, it is very hard to justify an assertion that
       "whitespace is ignored".  The three strings in the display above differ only in
       whitespace.  Clearly the placement of the whitespace makes a vast difference, and has a
       major effect on the structure of string, which in turn has a determining effect on its
       semantics.

Why the structural level?

As we've seen, the structural level ignores essential aspects of the language. It is
possible to describe a language using a single level of description. So why have a
structural (G1) level of description? Why not a "unified" instead of a "split"
description.

It turns out that, for most languages of practical size, particularly those that deploy
whitespace in a natural and intuitive way, a "unified" description rapidly becomes
unwriteable, and even more rapidly becomes unreadable. The reader should be able to
convince himself by taking the BNF from his favorite language standard and recasting it so
that every rule takes into account whitespace. As one example, consider declarations in
the C language.

unsigned int a;
unsigned*b;

In the first of the two lines above the whitespace is necessary. In the second of the two
lines whitespace would be allowed, but is not necessary. You cannot simply insist on
whitespace between all symbols, because whitespace is and should be optional between some
symbols and not between others. Where whitespace is optional, and where it should not be,
depends on which characters are adjacent to each other. This kind of character-level
information is not convenient to represent at the structural (G1) level.

It is certainly possible to write whitespace-aware BNF for the fragment of the C language
above. And it is certainly possible to extend it to include more and more of the
declaration syntax. But before you've extended the BNF very much, you will notice it is
becoming a lot harder to write. You will also notice that, as quickly as it is becoming
hard to write, it is even more quickly becoming "write-only" -- impossible to read. In
making your BNF whitespace-aware, you are more than doubling its size. And you are
burying what intuition sees as the structure of the language under a deep pile of special
cases.

Long before you finish, I expect you will realize that the "unified" approach is simply
not workable. The authors of the C language relegated lexical issues to their own brief
section, and ignored them in most of their language description. This was clearly the
only practical approach.

Interweaving the two levels

       The scanless interface interweaves the "split" and "unified" approaches and, I hope,
       preserves the best features of each.  Here is full syntax of the example whitespace-and-
       digit language, described using Marpa::R2's scanless interface:

           :start ::= <number sequence>
           <number sequence> ::= <number>+ action => add_sequence
           number ~ digit+
           digit ~ [0-9]
           :discard ~ whitespace
           whitespace ~ [\s]+

   A new operator
       In this example, three of the scanless interface's extensions to the Stuifzand interface
       are used.  First, the tilde (""~"") is used to separate LHS and RHS of rules at the
       lexical (L0) level.  Rules whose LHS and RHS are separated by the traditional BNF operator
       (""::="") are at the structural (G1) level.

       The programmer must decide when to use the ""~"" operator and when to use the ""::=""
       operator, but the choice will usually be easy: If you want Marpa to "do what I mean" with
       whitespace, you use the ""::="" operator.  If you want Marpa to do exactly what you say on
       a character-by-character basis, then you use the ""~"" operator.

   Character classes
       Perl character classes are now allowed on the RHS of prioritized and quantified rules.
       The example shows character classes only in L0 rules, but character classes can also be
       used in G1 rules.  When a character class is used in a G1 rule, it still must be
       implemented at the L0 level.  Marpa knows this and "does what you mean."

   Discard pseudo-rules
       A new type of rule is introduced: a "discard" pseudo-rule.  A discard pseudo-rule has a
       ":discard" pseudo-symbol on its LHS and one symbol name on its RHS.  It indicates that,
       when the RHS symbol is recognized, it should not be passed on as usual to the structural
       (G1) level.  Instead, the lexical (L0) level will simply "discard" what it has found.  In
       the example, whitespace is discarded.

Lexemes

       Tokens at the boundary between L0 and G1 have special significance.  The top-level
       undiscarded symbols in L0, which will be called "L0 lexemes", go on to become the
       terminals in G1.  G1's terminals are called "G1 lexemes".  To find the "L0 lexemes", Marpa
       looks for symbols which are on the LHS of a L0 rule, but not on the RHS of any L0 rule.
       To find the "G1 lexemes", Marpa looks for symbols on the RHS of at least one G1 rule, but
       not on the LHS of any G1 rule.

       L0 and G1 should agree on what is a lexeme and what is not.  If they do not, the
       programmer receives a fatal message which describes the problem and the symbols involved.
       So in practice I will usually simply refer to "lexemes".

Longest acceptable tokens match

       If you specify ""latm => 1"" as the default, which you almost always should, the L0
       grammar looks for tokens on a longest acceptable tokens match (LATM) basis.  Tokens which
       the structural grammar would reject are thrown away.  So are tokens in discard pseudo-
       rules.  The rest are passed on to the G1 grammar.

       Note that the match is longest TOKENS.  Several tokens may have the same length, so
       several tokens may be "longest".  When that happens, Marpa uses the full set of longest
       tokens in looking for possible parses.  For more about LATM and its alternative, LTM, see
       the detailed description of the "latm" adverb.

Semantics

       The value of a L0 rule is always the string it matches, and the value of a lexeme from the
       G1 point of view is the same as its value from the L0 point of view.  This means that it
       makes no sense to specify semantic actions for L0 rules, and that is not allowed.

       With the exception of lexeme values, the semantics of the G1 grammar are exactly the same
       as for ordinary grammars.  Actions may be specified for G1 rules and will behave as
       described in Marpa::R2::Semantics.

Implementation

       The scannerless interface uses two co-operating Marpa grammars, an approach pioneered by
       Andrew Rodland.  There are separate Marpa grammars for the L0 and G1 levels, as well as
       separate parsers.  The details of their interaction are hidden from the user.  Typically,
       the L0 parser finds tokens and passes them up to the G1 parser.

       The interface described in this document is surprisingly implementation-agnostic.  The
       author developed the basics of this interface while trying an implementation approach,
       that used a single Marpa grammar, before changing to the dual grammar implementation.

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.