Provided by: libmarpa-r2-perl_2.086000~dfsg-5build1_amd64 bug

NAME

       Marpa::R2::NAIF::Progress - Progress reports for the NAIF

About this document

       This document describes Marpa's progress reports for its named argument interface (NAIF).
       If you are a beginner, or are not sure which interface you are interested in, or do not
       know what the NAIF interfaces is, you probably are looking for the document on progress
       reports for the SLIF interface.

       Progress reports allow an application to know exactly where it is in the parse at any
       point.  For parse locations of the user's choosing, progress reports list all the rules in
       play, and indicate the location at which the rule started, and how far into the rule
       parsing has progressed.

       For those new to parsing, this "situational awareness" might seem to be a feature that
       they can expect from any servicable parser.  In fact, situation awareness in most parsers
       is extremely limited, and it tends to be more limited in those parsers considered
       production quality.  Marpa is not just unusual in the amount of feedback it offers the
       application, it breaks new ground.

       Progress reports are extremely useful in debugging grammars.  Because debugging creates a
       clear "narrative", the detailed example in this document is a debugging situation.
       Readers specifically interested in debugging a grammar should read the document on tracing
       problems before reading this document.

Introduction to Earley parsing

       To read the "show_progress" output, it is important to have a basic idea of what Earley
       items are, and of what the information in them means.  Everything that the user needs to
       know is explained in this section.

   Dotted rules
       The idea behind Earley's algorithm is that you can parse by building a table of rules and
       where you are in those rules.  "Where" means two things: location in the rule relative to
       the rule's symbols, and location relative to the parse's input stream.

       Let's look at an example of a rule in a context-free grammar.  Here's the rule for
       assignment from the Perl distribution's "perly.y"

       "    termbinop -> term ASSIGNOP term"

       "ASSIGNOP" is "perly.y"'s internal name for the assignment operator.  In plain Perl terms,
       this is the ""="" character.

       In parsing this rule, we can be at any of four possible locations.  One location is at the
       beginning, before all of the symbols.  The other three locations are immediately after
       each of the rule's three symbols.

       Within a rule, position relative to the symbols of the rule is traditionally indicated
       with a dot.  In fact, the symbol-relative rule position is very often called the dot
       location.  Taken as a pair, a rule and a dot location are called a dotted rule.

       Here's our rule with a dot location indicated:

       "    termbinop -> X term ASSIGNOP term"

       The dot location in this dotted rule is at the beginning.  A dot location at the beginning
       of a dotted rule means that we have not recognized any symbols in the rule yet.  All we
       are doing is predicting that the rule will occur.  A dotted rule with the dot before all
       of its symbols is called a prediction or a predicted rule.

       Here's another dotted rule:

       "    termbinop -> term X ASSIGNOP term"

       In this dotted rule, we are saying we have seen a "term", but have not yet recognized an
       "ASSIGNOP".

       There's another special kind of dotted rule, a completion.  A completion (also called a
       completed rule) is a dotted rule with the dot after all of the symbols.  Here is the
       completion for the rule that we have been using as an example:

       "    termbinop -> term ASSIGNOP term X"

       A completion indicates that a rule has been fully recognized.

   Earley items
       The dotted rules contain all but one piece of the information that Earley's algorithm
       needs to track.  The missing piece is the second of the two "wheres": where in the input
       stream.  To associate input stream location and dotted rules, Earley's algorithm uses what
       are now called Earley items.

       A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of
       dotted rule, origin and current location.  The origin is the location in the input stream
       where the dotted rule starts.  The current location (also called the dot location) is the
       location in the input stream which corresponds to the dot position.

       Two noteworthy consequences follow from the way in which origin and current location are
       defined.  First, if a dotted rule is a prediction, then origin and current location will
       always be the same.  Second, the input stream location where a rule ends is not tracked
       unless the dotted rule is a completion.  In other cases, an Earley item does not tell us
       if a rule will ever be completed, much less at which location.

The example

       For this example of debugging, I've taken a very common example of a grammar and
       deliberately introduced a problem.  (All the code and the full trace outputs for this
       example are in the Appendix.)  I've commented out the correct start rule:

               ## { lhs => 'Expression', rhs => [qw/Term/] },

       and replaced it with another start rule, one which will cause problems:

               { lhs => 'Expression', rhs => [qw/Factor/] },

       In what follows, we'll pretend we don't already know where the problem is, and use the
       Marpa diagnostics and tracing facilities to "discover" it.

First warning

       Right off the bat, we get two warning messages:

           Inaccessible symbol: Add
           Inaccessible symbol: Term

       If we were alert, these would be enough to tell us there is a serious problem.
       "Inaccessible" symbols are symbols which cannot be reached from the start symbol.  This
       means that the grammar will never produce them, and that parses will never find them in
       the input.

       Since "Add" and "Term" are both important symbols in our application, that should tell us
       our grammar has a serious problem.  In fact, these warning messages would often be enough
       to point us to the error.  But, in order to look at more of Marpa's tracing facilities,
       let's pretend we have not had our morning coffee, and that we miss the significance of
       these warning messages.

Output from trace_terminals()
       Before looking at Marpa's progress reports, it is usually best to orient yourself by
       looking at the output from "trace_terminals".  Typically, you will be interested in the
       last tokens to be accepted, and perhaps also in the tokens the recognizer was looking for
       when it didn't find what it wanted.  Sometimes that information alone is enough to make it
       clear where the problem is.

       The full "trace_terminals" output for this example is in the Appendix.  We see that the
       recognizer seems to accept ""42*1"" but it fails when it confronts the plus sign (""+"").
       The last two lines are:

           Accepted "Number" at 2-3
           Expecting "Multiply" at 3

       A note in passing: Marpa shows the location of the tokens it accepts as a range of
       locations.  For "Number", the range is ""2-3"", indicating that the token starts at
       location 2 and ends at location 3.  In its default input model, all tokens have length 1,
       so this is clearly information overkill.  But Marpa allows other input models, and in
       those models the information about start and end location of the token is important.

       Returning to the problem at hand: We notice that at location 3 we are expecting a
       "Multiply" operator, but not an "Add" operator.  That should strike us as strange, and
       send us back to the grammar.  But for the sake of our example we will assume that we are
       slow on the uptake today, and that this does not clue us in.  We move on.

Output from show_progress()
       Marpa's most powerful tool for debugging grammars is its progress report, which shows the
       Earley items being worked on.  In the Appendix, progress reports for the entire parse are
       shown.  Our example in this document is a very small one, so that producing progress
       reports for the entire parse is a reasonable thing to do in this case.  If a parse is at
       all large, you will usually need to be selective.

       The progress report that is usually of most interest is the one for the Earley set that
       you were working on when the error occurred.  This is called the current location.  In our
       example the current location is location 3.  By default, "show_progress" prints out only
       the progress reports for the current location.

       Here are the progress reports for the current location, location 3, from our example.

             F0 @0-3 Expression -> Factor .
             F2 @2-3 Factor -> Number .
             R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
             F4 @0-3 Factor -> Factor Multiply Factor .

   Progress report lines
       The last field of each Progress Report line shows, in fully expanded form, the dotted rule
       we were working on.  Since that is the most important information, it may be tempting to
       skip the rest of this section, and move directly forward with the debugging.

       In fact, you might want to do exactly that -- skip to the beginning of the next section.
       What follows talks about the details of the format of the first few fields in each
       progress report line.  These first few fields, while helpful, are also usually one or more
       of obvious in their meaning, not relevant to our example, and repetitive of information
       which can be deduced from other fields.

             F0 @0-3 Expression -> Factor .

       Prefixed to the dotted rule are two fields: ""F0 @0-3"".  The ""F0"" says that this is a
       completed or final rule, and that it is rule number 0.  The rule number is used in other
       tracing and debugging output, when displaying the whole rule would take too much space.
       In what follows we won't need the rule number.

       The ""@0-3"" describes the location of the dotted rule in the parse.  In its simplest
       form, the location field is two location numbers, separated by a hyphen.  The first
       location number is the origin, the place where Marpa first started recognizing the rule.
       The last location number is the dot location, the location location of the dot in a dotted
       rule.  ""@0-3"" say that this rule began at location 0, and that the dot is at location 3.

       The current location is location 3, and this is no coincidence.  Whenever we are
       displaying the progress report for an location, all the progress report lines will have
       their dot location at that location.

       As an aside, notice that the left hand side symbol is "Expression".  That is the start
       symbol.  The presence of a completed start rule in our progress report indicates that if
       our input ended at location 3, it would be a valid sentence in the language of our
       grammar.

       Let's look at another progress report line:

             R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor

       Here the ""R4:1"" indicates that this is rule number 4 (the ""R"" stands for rule number)
       and that its dot position is after the first symbol on the right hand side.  Symbol
       positions are numbered using the ordinal of the symbol just before the position.  Symbols
       are numbered starting with 1, and symbol position 1 is the position immediately after
       symbol 1.

       The next field (""x2"") is new.  It is a count.  A progress report can contain multiple
       instances of the same dotted rule, and when there is more than one, a count field is
       included in the progress report line.  Here the ""x2"" indicates that there are two
       instances of "Factor -> Factor . Multiply Factor" at this location.

       Multiple instances of a dotted rule will differ in their origin, and where they do, this
       is shown in the location field of the progress report line.  Here the location field is
       ""@0,2-3"", which indicates that one instance of this dotted rule has its origin at
       location 0, and the other has its origin at location 2.  All instances reported on a
       single progress report line will always have the same dot location, and in this case it is
       location 3.

       Predicted rules also appear in progress reports:

           P2 @2-2 Factor -> . Number

       Here the ""P"" in the summary field means "predicted".  As with much of the information in
       the summary field, this only repeats what is obvious from the full expansion of the dotted
       rule later in the line.  But final (or completed) and predicted rules can be important and
       the initial "F" and "P" make these lines easy to spot.

       Notice that in the predicted rule, the origin is the same as the dot location.  This will
       always be the case with predicted rules.

       For any given location, no predicted rule has more than one instance.  For other dotted
       rules, there may be many instances of the dotted rule at a single location.  In grammars
       with right recursion, the number of instances is limited only by the length of the
       recursion.  The length of a recursion is limited primarily by the available memory.

       When there are many instances of a dotted rule at a single location, it is inconvenient to
       show all the origins in a comma-separated list.  In that case the origins in the location
       field are shown as a range, with the earliest separated from the most recent by a ""..."".
       The example in this document contains no lines with a large number of instances, but here
       is an example from another grammar.  This is the progress report line for the completed
       rule in a right recursion of length 20.

           F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .

   OK!  Now to find the bug
       Here again are progress reports at the location where things went wrong:

             F0 @0-3 Expression -> Factor .
             F2 @2-3 Factor -> Number .
             R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
             F4 @0-3 Factor -> Factor Multiply Factor .

       We see that we have completed rules for "Expression", and "Factor", as expected.  We also
       see two Earley items that show that we are in the process of building another "Factor",
       and that it is expecting a "Multiply" symbol.  This is not the rule we want, but it
       explains why the "trace_terminals" output showed that the recognizer was expecting a
       "Multiply" symbol.

       What we want to know is, why is the recognizer not expecting an "Add" symbol?  Looking
       back at the grammar, we see that only one rule uses the "Add" symbol: the rule ""Term ->
       Term Add Term"".  The next step is to look at the Earley items for this rule.  But there
       is a problem.  We don't find any.

       Next, we ask ourselves, what is the earliest place the ""Term -> Term Add Term"" rule
       should be appearing?  The answer is that there should be a prediction of ""Term -> Term
       Add Term"" at location 0.  So we look at the predictions at location 0.

             P0 @0-0 Expression -> . Factor
             P2 @0-0 Factor -> . Number
             P4 @0-0 Factor -> . Factor Multiply Factor

       No ""Term -> Term Add Term"" rule.  We are never even predicting a ""Term -> Term Add
       Term"" rule.  We look back at the grammar, and start from the beginning.

           { lhs     => 'Expression', rhs => [qw/Factor/] },
           { lhs => 'Term',       rhs => [qw/Factor/] },
           { lhs => 'Factor',     rhs => [qw/Number/] },
           {   lhs    => 'Term',
               rhs    => [qw/Term Add Term/],
               action => 'do_add'
           },
           {   lhs    => 'Factor',
               rhs    => [qw/Factor Multiply Factor/],
               action => 'do_multiply'
           },

       The start symbol is "Expression" and we do see a rule with "Expression" on the left hand
       side.  "Expression" in turn produces a "Factor" symbol, and there are two rules with
       "Factor" on the left hand side.

       But none of these rules ever produce a "Term".  In fact, however far we follow the
       productions, no rule ever produces a "Term".  At this point we see the problem: If we
       start at the start symbol, and follow the rules of our grammar, we will never get to a
       "Term" symbol.  Which is exactly what that first warning message was saying.

       Now that we know what is wrong, we can reread our grammar, and see that our "Expression ->
       Factor" rule is wrong.  It should be "Expression -> Term".  Change that and the problem is
       fixed.

Empty rules

       When a symbol is nulled in your parse, "show_progress" does not show its expansion into
       rules.  This reduces clutter, and seems to be the intuitive way to treat nulled rules.

Access to the "raw" progress report information

       This section deals with the "progress()" recognizer method, which allows access to the raw
       progress report information.  This method is not needed for typical debugging and tracing
       situations.  It is intended for applications which want to leverage Marpa's "situational
       awareness" in innovative ways.

   progress()
           my $report0 = $recce->progress(0);

           my $latest_report = $recce->progress();

       Given the parse location (Earley set ID) as its argument, the "progress()" recognizer
       method returns a reference to an array of "report items".  The parse location may be
       negative.  An argument of -X will be interpreted as location N+X+1, where N is the latest
       Earley set.  This means that an argument of -1 indicates the latest Earley set, an
       argument of -2 indicates the Earley set just before the latest one, etc.

       Each report item is a triple: an array of three elements.  The three elements are, in
       order, rule ID, dot position, and origin.  The data returned by the two displays above, as
       well as the data for the other parse locations in our example, are shown below.

       The rule ID is the same number that Marpa uses to identify rules in tracing and debugging
       output.  Given a rule ID, an application can find its LHS and RHS symbols using the
       grammar's "rule()" method.

       Dot position is -1 for completions, and 0 for predictions.  Where the report item is not
       for a completion or a prediction, dot position is N, where N is the number of RHS symbols
       successfully recognized at the parse location of the progress report.

       Origin is parse location (Earley set ID) at which the rule application reported by the
       report item began.  For a prediction, origin will always be the same as the location of
       the parse report.

   Progress reports and efficiency
       When progress reports are used for production parsing, instead of just for debugging and
       tracing, efficiency considerations become significant.  We will start with the good news.
       "progress()" is implemented in C, so that the application can usually expect calls to
       "progress()" to be extremely fast.

       Now, as to the potential bad news: Applications planning frequent use of calls to
       "progress()" need to be aware that, where there is a right recursion at a parse location,
       "progress()" needs to follow the entire chain of recursions to create the progress report.
       That this is happening will always be evident from the parse report itself, which will
       contain one report item for every completion in the right-recursive chain of completions.
       If an application tries to produce progress reports for a large number of parse locations,
       and these parse locations have long right recursive chains, it can prove computationally
       expensive.

       The efficiency consideration just mentioned for following right recursions is not an issue
       for left recursions.  Left recursions only produce at most two report items per parse
       location and are extremely fast to process.  It is also not an issue for Marpa's sequence
       rules, because sequence rules are implemented internally as left recursions.

Appendix: full code and output for the example

       Below are the code, the trace outputs and the progress report for the example used in this
       document.

   Code
           my $grammar = Marpa::R2::Grammar->new(
               {   start          => 'Expression',
                   actions        => 'My_Actions',
                   default_action => 'first_arg',
                   rules          => [
                       ## This is a deliberate error in the grammar
                       ## The next line should be:
                       ## { lhs => 'Expression', rhs => [qw/Term/] },
                       ## I have changed the Term to 'Factor' which
                       ## will cause problems.
                       { lhs => 'Expression', rhs => [qw/Factor/] },
                       { lhs => 'Term',       rhs => [qw/Factor/] },
                       { lhs => 'Factor',     rhs => [qw/Number/] },
                       {   lhs    => 'Term',
                           rhs    => [qw/Term Add Term/],
                           action => 'do_add'
                       },
                       {   lhs    => 'Factor',
                           rhs    => [qw/Factor Multiply Factor/],
                           action => 'do_multiply'
                       },
                   ],
               }
           );

           $grammar->precompute();

           my @tokens = (
               [ 'Number',   42 ],
               [ 'Multiply', q{*} ],
               [ 'Number',   1 ],
               [ 'Add',      q{+} ],
               [ 'Number',   7 ],
           );

           sub My_Actions::do_add {
               my ( undef, $t1, undef, $t2 ) = @_;
               return $t1 + $t2;
           }

           sub My_Actions::do_multiply {
               my ( undef, $t1, undef, $t2 ) = @_;
               return $t1 * $t2;
           }

           sub My_Actions::first_arg { shift; return shift; }

           my $recce = Marpa::R2::Recognizer->new(
               { grammar => $grammar, trace_terminals => 2 } );

           my $token_ix = 0;

           TOKEN: for my $token_and_value (@tokens) {
               last TOKEN if not defined $recce->read( @{$token_and_value} );
           }

           $progress_report = $recce->show_progress( 0, -1 );

   Trace output
           Inaccessible symbol: Add
           Inaccessible symbol: Term
           Setting trace_terminals option
           Expecting "Number" at earleme 0
           Accepted "Number" at 0-1
           Expecting "Multiply" at 1
           Accepted "Multiply" at 1-2
           Expecting "Number" at 2
           Accepted "Number" at 2-3
           Expecting "Multiply" at 3
           Rejected "Add" at 3-4

       Note the use of the term "earleme".  If you are using the default input model, you can
       assume that earleme means "location": the earleme and the location will always be exactly
       the same.  Advanced users, using alternative input models, may set it up so that earleme
       and location are two different things, and in that case the distinction will matter.

   show_progress() output
           P0 @0-0 Expression -> . Factor
           P2 @0-0 Factor -> . Number
           P4 @0-0 Factor -> . Factor Multiply Factor
           F0 @0-1 Expression -> Factor .
           F2 @0-1 Factor -> Number .
           R4:1 @0-1 Factor -> Factor . Multiply Factor
           P2 @2-2 Factor -> . Number
           P4 @2-2 Factor -> . Factor Multiply Factor
           R4:2 @0-2 Factor -> Factor Multiply . Factor
           F0 @0-3 Expression -> Factor .
           F2 @2-3 Factor -> Number .
           R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
           F4 @0-3 Factor -> Factor Multiply Factor .

   progress() outputs
       These section contains the output of the "progress()" method -- the progress reports in
       their "raw" format.  The output is shown in Data::Dumper format, with
       "Data::Dumper::Indent" set to 0 and "Data::Dumper::Terse" set to 1.

       The "Data::Dumper" output from "progress()" at location 0:

           [[0,0,0],[2,0,0],[4,0,0]]

       The "Data::Dumper" output from "progress()" at location 1:

           [[0,-1,0],[2,-1,0],[4,1,0]]

       The "Data::Dumper" output from "progress()" at location 2:

           [[2,0,2],[4,0,2],[4,2,0]]

       The default "progress()" output is for the latest Earley set, which is location 3 in our
       example.  Here is the "progress()" output for location 3.

           [[0,-1,0],[2,-1,2],[4,-1,0],[4,1,0],[4,1,2]]

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.