Provided by: libmarpa-r2-perl_2.086000~dfsg-8build1_amd64 bug

NAME

       Marpa::R2::Advanced::Models - Other input models

About this document

       The alternative input models described in this document are an advanced technique.  If you
       are starting out with Marpa, you probably want to ignore this document.  If you are an
       experienced Marpa user, it is still safe to ignore this document, but you might find the
       possibilities it discusses interesting.

Marpa has two different ideas of location

       In the other Marpa documentation, we have spoken of "location", and assumed the standard
       input model.  The locations actually in use by the methods described for the standard
       input model were Earley set ordinals.  (An Earley set's ordinal is also its ID.)

       Marpa actually has two different ideas of location -- Earley set ordinal and earleme.
       This is ignored in the other Marpa documents, and it can be, because they assume the
       standard input model.  Use of the standard input model guarantees that earleme and Earley
       set ordinal will always be exactly the same.

       This document introduces methods which make it possible (and in fact likely) that earleme
       and Earley set ordinal will differ.  From here on, the reader will need to pay careful
       attention to the distinction.

What is an alternative input model?

       An alternative input model is anything that is not the default, token-stream model.  More
       helpfully, Marpa allows variable-length tokens and ambiguous tokens, and an alternative
       input model is any input model which

       •   Allows a token whose length is not exactly 1, or

       •   Allows locations which have more than one token.

       To do either of these things, a user must use the recognizer's "alternative" method.  In
       other words, if an application is not directly using the recognizer's "alternative" method
       call, that application is not using an alternative input method.

       Many concepts, such as parsing location, parse exhaustion, and the end of parsing, are
       somewhat more complicated when alternative input models are involved.  These concepts were
       explained in the main document for the recognizer on the assumption that the default input
       model was in use.  This document revises those explanations as necessary to take into
       account the alternative input models.

Token streams

       Marpa's default input model is the traditional one -- a token stream.  Token streams are
       very standard in parsing applications -- so much so that most texts do not take the
       trouble of defining the term.  A token stream is input structured as a sequence of tokens,
       where each token occupies one location and every location has a token.  In the token
       stream model, all tokens are of the same length.

       Conventionally, all tokens are of length 1, and the token stream starts at location 0.
       Following this convention, the Nth token would start at location N-1 and end at location
       N.  For example, the first token would start at location 0 and end at location 1.  The
       second token would start at location 1 and end at location 2.

Earlemes

       For most parsers, position is location in a token stream.  To deal with variable-length
       and overlapping tokens, Marpa needs a more flexible idea of location.

       Marpa's tracks position in terms of earlemes.  Earlemes are named after Jay Earley, the
       inventor of the first algorithm in Marpa's lineage.  Every token has a start earleme and
       an end earleme.

       The token stream model may also be called the token-per-earleme model.  In the token
       stream model, token location and earleme location are exactly identical.  More formally,
       in the token stream model, if the token location is N, then the earleme location is also
       N.  If a user's application uses the token stream model, the user can ignore the existence
       of earlemes, and can think entirely in terms of tokens and their position in a token
       stream.  Because of this, the main Marpa documents often speak simply of the "location" in
       the parse.

The furthest earleme

       The furthest earleme is the last earleme at which a token ends.  In the default input
       model, the furthest earleme and the current earleme are always the same.  As a result, in
       the default input model, the furthest earleme is not an important concept.

       In alternative input models, tokens may be longer than 1 earleme, and the furthest earleme
       and the current earleme may be far apart.  This becomes an issue when parsing is finished.
       Alternative input models use the recognizer's "end_input" method to ensure that processing
       of input catches up to the furthest earleme.

The latest Earley set and latest earleme

       The latest earleme is the earleme location of the latest Earley set.  In the default input
       model, the latest earleme is always the same as the current earleme.

       In alternative input models, there may not be an Earley set at a given earleme location.
       When that is the case for the current earleme, then the latest Earley set is not at the
       current earleme, and the latest earleme and current earlemes are different.

Methods

   alternative()
           $recce->alternative( 'a', \42, 1 ) or return 'First alternative failed';

       The "alternative" method is the most general method for reading input, and is used in
       alternative input models.  It takes three arguments, only the first of which is required.

       The first two arguments are the token type and a reference to the token value.  If the
       reference to the token value is omitted, or is "undef", the value of the token will be a
       Perl "undef".

       The third argument to the "alternative" method is a token length.  If omitted, token
       length defaults to 1, which is the correct value for the token stream model.  Its value
       can be any integer greater than zero.  Marpa does not allow zero length tokens in any
       input model.

       Unlike the "read" method, the "alternative" method does not advance the current location
       (current earleme) on each call.  This allows the application to read several tokens at the
       same earleme.  This is how ambiguous input is created.  To advance the current earleme
       when input is read using the "alternative" method, the "complete_earleme" method must be
       called.

       On success, "alternative" returns a Perl true value.  If the token is rejected,
       "alternative" returns a Perl false.  On other failures, "alternative" throws an exception.

   current_earleme()
           $current_earleme = $recce->current_earleme();

       Returns the current parse location, also known as the current earleme.  Not often needed.

   earleme()
             my $origin_earleme = $recce->earleme($origin_earley_set_id);

       Given an Earley set ID as its argument, the "earleme()" recognizer method returns the
       corresponding earleme.  Every integer in the range from 0 to the ID of the latest Earley
       is a valid Earley set ID, and every valid Earley set ID corresponds to an earleme.  If the
       argument of "earleme()" is greater than the latest Earley set ID, "earleme()" returns Perl
       "undef".

       There is currently no method that translates from earleme to Earley set.  Earley set to
       earleme translation is a well-behaved one-to-one function in all input models -- for every
       Earley set there is a earleme, and every earleme is mapped to by at most one Earley set.
       Earleme to Earley set translation is far less well-behaved.  In many input models, it is a
       partial function -- there are some earlemes that are in the valid range of earlemes but do
       not map to any Earley set.

       Earleme to Earley set translation is often not needed.  When it is, it can be implemented
       at the application level, with the application taking advantage of what it knows about its
       choice of input model.

   earleme_complete()
           $recce->earleme_complete();

       Processes all tokens at the current earleme and advances the current earleme by 1.  If the
       earleme cannot be completed, an exception is thrown.  Otherwise, an "event" count is
       returned.  If zero is returned, the earleme was completed without event.  In this context,
       an "event" is one of a list of occurrences of special interest during the successful
       completion of an earleme, as described for the recognizer's "read" method.

       All tokens read using the "alternative" method start at one location -- the current
       earleme.  When reading input using the "alternative" method, "earleme_complete" is used to
       complete processing at the current earleme and move forward in the input stream.

       "earleme_complete" may be called even if the "alternative" method has been not called
       since the last call to "earleme_complete".  This will create an earleme with no tokens.
       In certain input models, such as the character-per-earleme model, this can be useful.

   end_input()
           $recce->end_input();

       Indicates that input is finished.  Calling "end_input" is not necessary or useful in the
       default input model, because in the default input model no token has a length greater than
       1.

       The "end_input" method takes no arguments.  The "end_input" method returns a Perl true
       value on success.  On failure, it throws an exception.

       In alternative input models, calling the "earleme_complete" method once input is finished
       does not ensure that all input has been processed.  The "earleme_complete" method
       completes the current earleme, but in alternative models, tokens may extend well past the
       current earleme.  The "end_input" method ensures that all input is processed.

       Calling "end_input" multiple times on the same recognizer object is harmless, but useless.
       The second and subsequent calls will return a Perl true, but will have no effect.

Alternative models and "read()"

       A recognizer can mix calls to its "read" method with calls to its "alternative" method.
       The "read" method has the same effect as a single call to the "alternative" method,
       followed immediately by a call of the "earleme_complete" method.

Ambiguous lexing

       Marpa allows ambiguous tokens.  Several Marpa tokens can start at a single parsing
       location.  Ambiguous tokens can be of various lengths.  Tokens can also overlap.

       Potentially ambiguous lexing occurs when more than one token starts at a single earleme.
       When potentially ambiguous lexing occurs, it becomes possible for there to be more than
       one sequence of tokens.

       An actual lexical ambiguity only occurs if more than one of the potential token sequences
       is consistent with the grammar.  If there is no actual lexical ambiguity, Marpa will use
       the only token choice that is consistent with the grammar.

       When lexing is actually ambiguous, Marpa will use all the alternatives consistent with the
       grammar.  When the lexing in a parse is actually ambiguous, the parse will be ambiguous.
       From the point of view of Marpa's semantics, ambiguities caused by lexing look the same as
       ambiguities caused by an ambiguous grammar.

       In the standard terminology, if a grammar produces more than one parse tree for any input,
       then that grammar must be ambiguous.  In Marpa this is not strictly true.  In Marpa, if
       the input is ambiguous, even an unambiguous grammar can produce more than one parse.

Duplicate tokens

       A duplicate token is a token of the same type and the same length as another that was read
       at the same earleme.  Duplicate tokens are impossible in the default, token-stream, model.
       This is because in the token-stream model only one token can be read at each earleme.

       In alternative models, more than one token may be read at an earleme, and duplicates are
       possible.  Marpa detects duplicate tokens and treats them as "hard errors" -- Marpa throws
       an exception when it sees a duplicate token.  Marpa's assumption is that duplicate tokens
       indicate an error at the application level.

       An application can retry input after a duplicate token, if it catches the exception.  In
       the future, if recovery from duplicate tokens is found to be a useful technique, Marpa may
       provide an option to change its behavior, so that a soft failure is returned when there is
       a duplicate token.

Earlemes: the details

       While scanning, Marpa keeps track of the current earleme.  Earlemes in a parse start at
       earleme 0 and increase numerically.  The earleme immediately following earleme 0 is
       earleme 1, the earleme immediately following earleme 1 is earleme 2, and so on.  The
       earleme immediately following earleme N is always earleme N+1.

       Distance in the earleme stream is what you would intuitively expect it to be.  The
       distance between earleme X and earleme Y is the absolute value of the difference between X
       and Y, |X-Y|.  The distance from earleme 3 to earleme 6, for example, is 3 earlemes.

       Whenever a token is given to Marpa to be scanned, it starts at the current earleme.  In
       addition to the type and value of the token, Marpa must be told the token's length in
       earlemes.  The length of a Marpa token must be greater than zero.

       This earleme length will become the distance from the start of the token to the end of the
       token.  If the length of the token is L, and the current earleme is C, the end of the
       token will be at earleme C+L.

The character-per-earleme model

       Many different models of the relationship between tokens and earlemes are possible, but
       two are particularly important.  One is the one-token-per-earleme model, which is the
       default, and which has already been described.  The other is the one-character-per-earleme
       model.

       In the one-character-per-earleme model, every character will be treated as being exactly
       one earleme in length.  If a token is more than one character in length, that token will
       span earlemes.  When the lexing is ambiguous, tokens may overlap.

       When a one-character-per-earleme model of input is used, there may be many earlemes at
       which no tokens start.  For example, in a straightforward character-per-earleme
       implementation of a grammar for a language that allows comments, no tokens will start at
       any earlemes which correspond to character locations inside a comment.

Other input models

       So far only the token-per-earleme and character-per-earleme models have seen any real use
       in Marpa programs.  But other models are certainly possible.  Using earlemes, you can
       structure your input in almost any way you like.

       There are only three restrictions:

       1.  Scanning always starts at earleme 0.

       2.  All tokens starting at earleme N must be scanned before any tokens starting at earleme
           N+1.  In other words, the tokens must be scanned in non-decreasing order by start
           earleme.

       3.  Every token must have a length, in earlemes, which is greater than zero.  In other
           words, token length can never be zero or negative.

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.