Ubuntu Manpage: Marpa::R2::NAIF::Recognizer

Provided by: libmarpa-r2-perl_2.086000~dfsg-6build2_amd64

NAME

       Marpa::R2::NAIF::Recognizer - NAIF recognizers

Synopsis

           my $recce = Marpa::R2::Recognizer->new( { grammar => $grammar } );
           $recce->read( 'Number', 42 );
           $recce->read('Multiply');
           $recce->read( 'Number', 1 );
           $recce->read('Add');
           $recce->read( 'Number', 7 );

Description

       This document describes recognizers for Marpa's named argument interface (NAIF).  If you
       are a beginner, or are not sure which interface you are interested in, or do not know what
       the NAIF interfaces is, you probably are looking for the document on recognizers for the
       SLIF interface.

       To create a recognizer object, use the "new" method.

       To read input, use the "read" method.

       To evaluate a parse tree, based on the input, use the "value" method.

   Token streams
       By default, Marpa uses the token-stream model of input.  The token-stream model is
       standard -- so standard the most documents about parsing do not bother to describe it.  In
       the token-stream model, each read adds a token at the current location, then advances the
       current location by one.  The location before any input is numbered 0 and if N tokens are
       parsed, they fill the locations from 1 to N.

       This document will describe only the token-stream model of input.  Marpa allows other
       models of the input, but their use requires special method calls, which are described in
       the document on alternative input models.

Constructor

   new()
           my $recce = Marpa::R2::Recognizer->new( { grammar => $grammar } );

       The "new" method creates a recognizer object.  The "new" method either returns a new
       recognizer object or throws an exception.

       The arguments to the "new" method are references to hashes of named arguments.  In each
       key/value pair of these hashes, the key is the argument name, and the hash value is the
       value of the argument.  The named arguments are described below.

Accessors

   check_terminal()
           my $is_symbol_a_terminal = $recce->check_terminal('Document');

       Returns a Perl true when its argument is the name of a terminal symbol.  Otherwise,
       returns a Perl false.  Not often needed.

   events()
           my @expected_symbols =
               map { $_->[1]; }
               grep { $_->[0] eq 'SYMBOL_EXPECTED' } @{ $recce->events() };

       Returns a reference to an array of the events from the last "read()" method call.  Each
       element of the array is a subarray or 1 or 2 elements.  The first element of the subarray
       is the name of an event type, as described in "Recognizer events".  The second element is
       the event value of the event, where that is applicable.  For more detail, see "Recognizer
       events".

   exhausted()
               $recce->exhausted() and die 'Recognizer exhausted';

       The "exhausted" method returns a Perl true if parsing in a recognizer is exhausted, and a
       Perl false otherwise.  Parsing is exhausted when the recognizer will not accept any
       further input.  By default, a recognizer event occurs if parsing is exhausted.  An attempt
       to read input into an exhausted parser causes an exception to be thrown.  The recognizer
       event and the exception are all that many applications require, but this method allows the
       recognizer's exhaustion status to be discovered directly.

   latest_earley_set()
           my $latest_earley_set = $recce->latest_earley_set();

       Return the location of the latest (in other words, the most recent) Earley set.  In the
       places where it is most often needed, the latest Earley set is the default, and there is
       usually no need to request the explicit value of the latest Earley set.

   progress()
       Given the location (Earley set ID) as its argument, returns an array that describes the
       parse progress at that location.  Details on progress reports can be found in their own
       document.

   terminals_expected()
           my $terminals_expected = $recce->terminals_expected();

       Returns a reference to a list of strings, where the strings are the names of the terminals
       acceptable at the current location.  In the default input model, the presence of a
       terminal in this list means that terminal will be acceptable in the next "read" method
       call.  This is highly useful for Ruby Slippers parsing.

Mutators

   expected_symbol_event_set()
           $recce->expected_symbol_event_set( 'endmark', 1 );

       Marpa can generate a recognizer event when a symbol is expected at the current earleme.
       This method takes a symbol name as its first argument, and turns the expected-symbol event
       for that symbol on or off, according to whether its second argument is 1 or 0.  Always
       succeeds or throws an exception.

       Events can occur at location 0 -- when the recognizer is first created.  However, the
       event setting of "expected_symbol_event_set()" cannot have an effect until after the first
       token is read -- after location 0.  In cases where this is an issue, the
       "event_if_expected" named argument of the "new()" method can be used to set an expected-
       symbol event.

   read()
           $recce->read( 'Number', 42 );
           $recce->read('Multiply');
           $recce->read( 'Number', 1 );
           $recce->read('Add');
           $recce->read( 'Number', 7 );

       The "read" method reads one token at the current parse location.  It then advances the
       current location by 1.

       "read" takes two arguments: a token name and a token value.  The token name is required.
       It must be the name of a valid terminal symbol.  The token value is optional.  It defaults
       to a Perl "undef".  For details about terminal symbols, see "Terminals" in
       Marpa::R2::NAIF::Grammar.

       The parser may accept or reject the token.  If the parser accepted the token, the "read"
       method returns the number of recognizer events that occurred during the "read".  For more
       about events, see "Recognizer events".

       Marpa may reject a token because it is not one of those acceptable at the current
       location.  When this happens, "read" returns a Perl "undef".  A rejected token need not
       end parsing -- it is perfectly possible to retry the "read" call with another token.  This
       is, in fact, an important technique in Ruby Slippers parsing.  For details, see the
       section on Ruby Slippers parsing.

       For other failures, including an attempt to "read" a token into an exhausted parser, Marpa
       throws an exception.

   set()
           $recce->set( { max_parses => 10, } );

       The "set" method's arguments are references to hashes of named arguments.  The "set"
       method can be used to set or change named arguments after the recognizer has been created.
       Details of the named arguments are below.

   value()
           my $value_ref = $recce->value;
           my $value = $value_ref ? ${$value_ref} : 'No Parse';

       Because Marpa parses ambiguous grammars, every parse is a series of zero or more parse
       trees.  There are zero parse trees if there was no valid parse of the input according to
       the grammar.

       The "value" method call evaluates the next parse tree in the parse series, and returns a
       reference to the parse result for that parse tree.  If there are no more parse trees, the
       "value" method returns "undef".

   reset_evaluation()
               $recce->reset_evaluation();
               $recce->set( { end => $loc, max_parses => 999, } );

       The "reset_evaluation()" method ends a parse series, and starts another.  It can be used
       to "restart" the parse series.  Restarting the parse series with the "reset_evaluation()"
       method allows the application to specify new values for the "closures", "end" and
       "ranking_method" named arguments.  Once a parse series is underway, these values cannot be
       changed.

       The most common use for "reset_evaluation()" method is to parse a single input stream at
       different end points.  This can also be done by creating a new recognizer and re-reading
       the input from the beginning, but it is much more efficient to evaluate a single
       recognizer run several times, using different parse end locations.  After the parse is
       restarted using the "reset_evaluation()" method, the recognizer's "set()" method and its
       "end" named argument can be used to change the parse end location.

Trace accessors

   show_progress()
           print $recce->show_progress()
               or die "print failed: $ERRNO";

       Returns a string describing the progress of the parse.  With no arguments, the string
       contains reports for the current location.  With a single integer argument N, the string
       contains reports for location N.  With two numeric arguments, N and M, the arguments are
       interpreted as a range of locations and the returned string contains reports for all
       locations in the range.  ("Location" as referred to in this section, and elsewhere in this
       document, is what is also called the Earley set ID.)

       If an argument is negative, -N, it indicates the Nth location counting backward from the
       furthest location of the parse.  For example, if 42 was the furthest location, -1 would be
       location 42 and -2 would be location 41.  For example, the method call
       "$recce->show_progress(-3, -1)" returns reports for the last three locations of the parse.
       The method call "$recce->show_progress(0, -1)" will print progress reports for the entire
       parse.

       "show_progress" is Marpa's most powerful tool for debugging application grammars.  It can
       also be used to track the progress of a parse or to investigate how a parse works.  A much
       fuller description, with an example, is in the document on debugging Marpa grammars.

Named arguments

The recognizer's named arguments are accepted by its "new" and "set" methods.

closures
The value of "closures" named argument must be a reference to a hash. In each key/value
pair of this hash, the key must be an action name. The hash value must be a CODE ref.
The "closures" named argument is not allowed once evaluation has begun.

When an action name is a key in the "closures" named argument, the usual action resolution
mechanism of the semantics is bypassed. One common use of the "closures" named argument
is to allow anonymous subroutines to be semantic actions. For more details, see the
document on semantics.

end
The "end" named argument specifies the parse end location. The default is for the parse
to end where the input did, so that the parse returned is of the entire input. The "end"
named argument is not allowed once evaluation has begun. "Location" as referred to here
and elsewhere in this document is what is also called an Earley set ID.

event_if_expected
The value of the "event_if_expected" named argument must be a reference to an array of
symbol names. Expected-symbol events will be turned on for those symbol names. Expected-
symbol events may be turned off (or back on) using the "expected_symbol_event_set()"
method. The advantage of the "event_if_expected" named argument is that it takes effect
as soon as the recognizer is created, while events set using the
"expected_symbol_event_set()" method cannot occur until after the first token is read.

grammar
The "new" method is required to have a "grammar" named argument. Its value must be a
precomputed Marpa grammar object. The "grammar" named argument is not allowed anywhere
else.

max_parses
If non-zero, causes a fatal error when that number of parse results is exceeded.
"max_parses" is useful to limit CPU usage and output length when testing and debugging.
Stable and production applications may prefer to count the number of parses, and take a
less Draconian response when the count is exceeded.

The value must be an integer. If it is zero, there will be no limit on the number of
parse results returned. The default is for there to be no limit.

ranking_method
The value must be a string: one of ""none"", ""rule"", or ""high_rule_only"". When the
value is ""none"", Marpa returns the parse results in arbitrary order. This is the
default. The "ranking_method" named argument is not allowed once evaluation has begun.

The ""rule"" and ""high_rule_only"" ranking methods allows the user to control the order
in which parse results are returned by the "value" method, and to exclude some parse
results from the parse series. For details, see the document on parse order.

too_many_earley_items
The "too_many_earley_items" argument is optional. If specified, it sets the Earley item
warning threshold. If an Earley set becomes larger than the Earley item warning
threshold, a recognizer event is generated, and a warning is printed to the trace file
handle.

Marpa parses from any BNF, and can handle grammars and inputs which produce large Earley
sets. But parsing that involves large Earley sets can be slow. Large Earley sets are
something most applications can, and will wish to, avoid.

By default, Marpa calculates an Earley item warning threshold based on the size of the
grammar. The default threshold will never be less than 100. If the Earley item warning
threshold is set to 0, no recognizer event is generated, and warnings about large Earley
sets are turned off.

trace_actions
The "trace_actions" named argument is a boolean. If the boolean value is true, Marpa
prints tracing information as it resolves action names to Perl closures. A boolean value
of false turns tracing off, which is the default. Traces are written to the trace file
handle.

trace_file_handle
The value is a file handle. Traces and warning messages go to the trace file handle. By
default the trace file handle is inherited from the grammar used to create the recognizer.

trace_terminals
Very handy in debugging, and often useful even when the problem is not in the lexing. The
value is a trace level. When the trace level is 0, tracing of terminals is off. This is
the default.

At a trace level of 1 or higher, Marpa produces a trace message for each terminal as it is
accepted or rejected by the recognizer. At a trace level of 2 or higher, the trace
messages include, for every location, a list of the terminals expected. In practical
grammars, output from trace level 2 can be voluminous.

trace_values
The "trace_values" named argument is a numeric trace level. If the numeric trace level is
1, Marpa prints tracing information as values are computed in the evaluation stack. A
trace level of 0 turns value tracing off, which is the default. Traces are written to the
trace file handle.

warnings
The value is a boolean. Warnings are written to the trace file handle. By default, the
recognizer's warnings are on. Usually, an application will want to leave them on.

Recognizer events

       The recognizer's "read()" method can generate events.  To access events, use the
       recognizer's "events()" method.

       The "EARLEY_ITEM_THRESHOLD" and The "EXHAUSTED" events are enabled by default.  Events
       optionally have an "event value", as specified in the description of each event.  The
       following events are possible.

   EARLEY_ITEM_THRESHOLD
       The Earley item threshold was exceeded.  For more about the Earley item warning threshold,
       see "too_many_earley_items".  No event value is defined for this event.  This event is
       enabled by default.

   EXHAUSTED
       "Exhaustion" means that the next "read" call must fail, because there is no token that
       will be acceptable to it.  More details on "exhaustion" are in a section below.  No event
       value is defined for this event.  This event is enabled by default.

   SYMBOL_EXPECTED
       A "symbol expected" event means that a symbol is expected at that point.  The event value
       of this event is the symbol whose expectation caused the event.  This event is disabled by
       default.  For details, see "expected_symbol_event_set()".

Parse exhaustion

       A parse is exhausted when it will accept no more input.  An exhausted parse is not
       necessarily a failed parse.  Grammars are often written so that once they "find what they
       are looking for", no further input is acceptable.  Grammars of that kind become exhausted
       when they succeed.

       By default, a recognizer event occurs whenever the parse is exhausted.  An application can
       also check for exhaustion explicitly, using the recognizer's "exhausted" method.

Ruby Slippers parsing

           $recce = Marpa::R2::Recognizer->new( { grammar => $grammar } );

           my @tokens = (
               [ 'Number', 42 ],
               ['Multiply'], [ 'Number', 1 ],
               ['Add'],      [ 'Number', 7 ],
           );

           TOKEN: for ( my $token_ix = 0; $token_ix <= $#tokens; $token_ix++ ) {
               defined $recce->read( @{ $tokens[$token_ix] } )
                   or fix_things( $recce, $token_ix, \@tokens )
                   or die q{Don't know how to fix things};
           }

       Marpa is able to tell the application which symbols are acceptable as tokens at the next
       location in the parse.  The "terminals_expected" method returns the list of tokens that
       will be accepted by the next "read".  The application can use this information to change
       the input "on the fly" so that it is acceptable to the parser.

       An application can also take a "try it and see" approach.  If an application is not sure
       whether a token is acceptable or not, the application can try to read the dubious token
       using the "read" method.  If the token is rejected, the "read" method call will return a
       Perl "undef".  At that point, the application can retry the "read" with a different token.

   An example
       Marpa's HTML parser, Marpa::HTML, is an example of how Ruby Slippers parsing can help with
       a non-trivial, real-life application.  When a token is rejected in Marpa::HTML, it changes
       the input to match the parser's expectations by

       •   Modifying existing tokens, and

       •   Creating new tokens.

       The second technique, the creation of new "virtual" tokens, is used by Marpa::HTML to deal
       with omitted start and end tags.  The actual HTML grammar that Marpa::HTML uses takes an
       oversimplified view of the HTML -- it assumes, even when the HTML standards do not require
       it, that start and end tags are always present.  For most HTML files of interest, this
       assumption will be contrary to fact.

       Ruby Slippers parsing is used to make the grammar's over-simplistic view of the world come
       true for it.  Whenever a token is rejected, Marpa::HTML looks at the expected tokens list.
       If it sees that a start or end tag is expected, Marpa::HTML creates a token for it -- a
       completely new "virtual" token that gives the parser exactly what it expects.  Marpa::HTML
       then resumes input at the point in the original input stream where it left off.

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.