Provided by: libmarpa-r2-perl_12.000000-1_amd64 

NAME
Marpa::R2::Advanced::Input_Models - Alternative input models
About this document
The alternative input models described in this document are an advanced technique. While you are
starting out with Marpa, you probably want to ignore this document. If you are an experienced Marpa
user, it is still safe to ignore this document, but you might find the possibilities it discusses
interesting.
This document describes the input models at a conceptual, rather than an implementation level. The
reader should be familiar with the recognizer's external scanning capability. A reader interested in
implementation is assumed to be familiar with the methods described in "Mutators for external scanning"
in Marpa::R2::Scanless::R,
Marpa has two different ideas of location
In the other Marpa documentation, we have spoken of "location", and assumed the standard input model.
The locations actually in use by the methods described for the standard input model were Earley set
ordinals. (An Earley set's ordinal is also its ID.)
Marpa actually has two different ideas of location -- Earley set ordinal and earleme. This is ignored in
the other Marpa documents, and it can be, because they assume the standard input model. Use of the
standard input model guarantees that earleme and Earley set ordinal will always be exactly the same.
This document introduces methods which make it possible (and in fact likely) that earleme and Earley set
ordinal will differ. From here on, the reader will need to pay careful attention to the distinction.
What is an alternative input model?
An alternative input model is anything that is not the default, token-stream model. More helpfully,
Marpa allows variable-length tokens and ambiguous tokens, and an alternative input model is any input
model which
• Allows a token whose length is not exactly 1, or
• Allows locations which have more than one token.
To do either of these things, a user must use external scanning and the recognizer'x external scanning
methods. In particular, if an application is not directly using the recognizer's "lexeme_read" or
"lexeme_alternative" method calls, that application is not using an alternative input method.
Many concepts, such as parsing location, parse exhaustion, and the end of parsing, are somewhat more
complicated when alternative input models are involved. These concepts were explained in the main
document for the recognizer on the assumption that the default input model was in use. This document
revises those explanations as necessary to take into account the alternative input models.
Token streams
Marpa's default input model is the traditional one -- a token stream. Token streams are very standard in
parsing applications -- so much so that most texts do not take the trouble of defining the term. A token
stream is input structured as a sequence of tokens, where each token occupies one location and every
location has a token. In the token stream model, all tokens are of the same length.
Conventionally, all tokens are of length 1, and the token stream starts at location 0. Following this
convention, the Nth token would start at location N-1 and end at location N. For example, the first
token would start at location 0 and end at location 1. The second token would start at location 1 and
end at location 2.
Earlemes
For most parsers, position is location in a token stream. To deal with variable-length and overlapping
tokens, Marpa needs a more flexible idea of location.
Marpa's tracks position in terms of earlemes. Earlemes are named after Jay Earley, the inventor of the
first algorithm in Marpa's lineage. Every token has a start earleme and an end earleme.
The token stream model may also be called the token-per-earleme model. In the token stream model, token
location and earleme location are exactly identical. More formally, in the token stream model, if the
token location is N, then the earleme location is also N. If a user's application uses the token stream
model, the user can ignore the existence of earlemes, and can think entirely in terms of tokens and their
position in a token stream. Because of this, the main Marpa documents often speak simply of the
"location" in the parse.
The furthest earleme
The furthest earleme is the last earleme at which a token ends. In the default input model, the furthest
earleme and the current earleme are always the same. As a result, in the default input model, the
furthest earleme is not an important concept.
In alternative input models, tokens may be longer than 1 earleme, and the furthest earleme and the
current earleme may be far apart. This becomes an issue when parsing is finished. Alternative input
models use one or more calls to the recognizer's lexeme_complete() method to ensure that processing of
input catches up to the furthest earleme.
The latest Earley set and latest earleme
The latest earleme is the earleme location of the latest Earley set. In the default input model, the
latest earleme is always the same as the current earleme.
In alternative input models, there may not be an Earley set at a given earleme location. When that is
the case for the current earleme, then the latest Earley set is not at the current earleme, and the
latest earleme and current earlemes are different.
Ambiguous lexing
Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Ambiguous
tokens can be of various lengths. Tokens can also overlap.
Potentially ambiguous lexing occurs when more than one token starts at a single earleme. When
potentially ambiguous lexing occurs, it becomes possible for there to be more than one sequence of
tokens.
An actual lexical ambiguity only occurs if more than one of the potential token sequences is consistent
with the grammar. If there is no actual lexical ambiguity, Marpa will use the only token choice that is
consistent with the grammar.
When lexing is actually ambiguous, Marpa will use all the alternatives consistent with the grammar. When
the lexing in a parse is actually ambiguous, the parse will be ambiguous. From the point of view of
Marpa's semantics, ambiguities caused by lexing look the same as ambiguities caused by an ambiguous
grammar.
In the standard terminology, if a grammar produces more than one parse tree for any input, then that
grammar must be ambiguous. In Marpa this is not strictly true. In Marpa, if the input is ambiguous,
even an unambiguous grammar can produce more than one parse.
Duplicate tokens
A duplicate token is a token of the same type and the same length as another that was read at the same
earleme. Duplicate tokens are impossible in the default, token-stream, model. This is because in the
token-stream model only one token can be read at each earleme.
In alternative models, more than one token may be read at an earleme, and duplicates are possible. Marpa
detects duplicate tokens and treats them as "hard errors" -- Marpa throws an exception when it sees a
duplicate token. Marpa's assumption is that duplicate tokens indicate an error at the application level.
An application can retry input after a duplicate token, if it catches the exception. In the future, if
recovery from duplicate tokens is found to be a useful technique, Marpa may provide an option to change
its behavior, so that a soft failure is returned when there is a duplicate token.
Earlemes: the details
While scanning, Marpa keeps track of the current earleme. Earlemes in a parse start at earleme 0 and
increase numerically. The earleme immediately following earleme 0 is earleme 1, the earleme immediately
following earleme 1 is earleme 2, and so on. The earleme immediately following earleme N is always
earleme N+1.
Distance in the earleme stream is what you would intuitively expect it to be. The distance between
earleme X and earleme Y is the absolute value of the difference between X and Y, |X-Y|. The distance
from earleme 3 to earleme 6, for example, is 3 earlemes.
Whenever a token is given to Marpa to be scanned, it starts at the current earleme. In addition to the
type and value of the token, Marpa must be told the token's length in earlemes. The length of a Marpa
token must be greater than zero.
This earleme length will become the distance from the start of the token to the end of the token. If the
length of the token is L, and the current earleme is C, the end of the token will be at earleme C+L.
The character-per-earleme model
Many different models of the relationship between tokens and earlemes are possible, but two are
particularly important. One is the one-token-per-earleme model, which is the default, and which has
already been described. The other is the one-character-per-earleme model.
In the one-character-per-earleme model, every character will be treated as being exactly one earleme in
length. If a token is more than one character in length, that token will span earlemes. When the lexing
is ambiguous, tokens may overlap.
When a one-character-per-earleme model of input is used, there may be many earlemes at which no tokens
start. For example, in a straightforward character-per-earleme implementation of a grammar for a
language that allows comments, no tokens will start at any earlemes which correspond to character
locations inside a comment.
Other input models
So far only the token-per-earleme and character-per-earleme models have seen any real use in Marpa
programs. But other models are certainly possible. Using earlemes, you can structure your input in
almost any way you like.
There are only three restrictions:
1. Scanning always starts at earleme 0.
2. All tokens starting at earleme N must be scanned before any tokens starting at earleme N+1. In other
words, the tokens must be scanned in non-decreasing order by start earleme.
3. Every token must have a length, in earlemes, which is greater than zero. In other words, token
length can never be zero or negative.
Copyright and License
Copyright 2022 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.
perl v5.40.1 2025-06-25 Marpa::R2::Advanced::Input_Models(3pm)