Ubuntu Manpage: re2c - convert regular expressions to C/C++ code

name
synopsis
description
options
interface code
syntax
scanner with storable states
scanner with condition support
encodings
generic input api
see also
authors
version information

NAME

       re2c - convert regular expressions to C/C++ code

SYNOPSIS

       re2c [OPTIONS] FILE

DESCRIPTION

       re2c  is a lexer generator for C/C++. It finds regular expression specifications inside of C/C++ comments
       and replaces them with a hard-coded DFA. The user must supply some interface code in order to control and
       customize the generated DFA.

OPTIONS

-? -h --help
Invoke a short help.

-b --bit-vectors
Implies -s. Use bit vectors as well in the attempt to coax better code out of the compiler. Most
useful for specifications with more than a few keywords (e.g. for most programming languages).

-c --conditions
Used to support (f)lex-like condition support.

-d --debug-output
Creates a parser that dumps information about the current position and in which state the parser
is while parsing the input. This is useful to debug parser issues and states. If you use this
switch you need to define a macro YYDEBUG that is called like a function with two parameters: void
YYDEBUG (int state, char current). The first parameter receives the state or -1 and the second
parameter receives the input at the current cursor.

-D --emit-dot
Emit Graphviz dot data. It can then be processed with e.g. dot -Tpng input.dot > output.png.
Please note that scanners with many states may crash dot.

-e --ecb
Generate a parser that supports EBCDIC. The generated code can deal with any character up to 0xFF.
In this mode re2c assumes that input character size is 1 byte. This switch is incompatible with
-w, -x, -u and -8.

-f --storable-state
Generate a scanner with support for storable state.

-F --flex-syntax
Partial support for flex syntax. When this flag is active then named definitions must be
surrounded by curly braces and can be defined without an equal sign and the terminating semi
colon. Instead names are treated as direct double quoted strings.

-g --computed-gotos
Generate a scanner that utilizes GCC's computed goto feature. That is re2c generates jump tables
whenever a decision is of a certain complexity (e.g. a lot of if conditions are otherwise
necessary). This is only useable with GCC and produces output that cannot be compiled with any
other compiler. Note that this implies -b and that the complexity threshold can be configured
using the inplace configuration cgoto:threshold.

-i --no-debug-info
Do not output #line information. This is useful when you want use a CMS tool with the re2c output
which you might want if you do not require your users to have re2c themselves when building from
your source.

-o OUTPUT --output=OUTPUT
Specify the OUTPUT file.

-r --reusable
Allows reuse of scanner definitions with /*!use:re2c */ after /*!rules:re2c */. In this mode no
/*!re2c */ block and exactly one /*!rules:re2c */ must be present. The rules are being saved and
used by every /*!use:re2c */ block that follows. These blocks can contain inplace configurations,
especially re2c:flags:e, re2c:flags:w, re2c:flags:x, re2c:flags:u and re2c:flags:8. That way it
is possible to create the same scanner multiple times for different character types, different
input mechanisms or different output mechanisms. The /*!use:re2c */ blocks can also contain
additional rules that will be appended to the set of rules in /*!rules:re2c */.

-s --nested-ifs
Generate nested ifs for some switches. Many compilers need this assist to generate better code.

-t HEADER --type-header=HEADER
Create a HEADER file that contains types for the (f)lex-like condition support. This can only be
activated when -c is in use.

-u --unicode
Generate a parser that supports UTF-32. The generated code can deal with any valid Unicode
character up to 0x10FFFF. In this mode re2c assumes that input character size is 4 bytes. This
switch is incompatible with -e, -w, -x and -8. This implies -s.

-v --version
Show version information.

-V --vernum
Show the version as a number XXYYZZ.

-w --wide-chars
Generate a parser that supports UCS-2. The generated code can deal with any valid Unicode
character up to 0xFFFF. In this mode re2c assumes that input character size is 2 bytes. This
switch is incompatible with -e, -x, -u and -8. This implies -s.

-x --utf-16
Generate a parser that supports UTF-16. The generated code can deal with any valid Unicode
character up to 0x10FFFF. In this mode re2c assumes that input character size is 2 bytes. This
switch is incompatible with -e, -w, -u and -8. This implies -s.

-8 --utf-8
Generate a parser that supports UTF-8. The generated code can deal with any valid Unicode
character up to 0x10FFFF. In this mode re2c assumes that input character size is 1 byte. This
switch is incompatible with -e, -w, -x and -u.

--case-insensitive
All strings are case insensitive, so all "-expressions are treated in the same way '-expressions
are.

--case-inverted
Invert the meaning of single and double quoted strings. With this switch single quotes are case
sensitive and double quotes are case insensitive.

--no-generation-date
Suppress date output in the generated file.

--no-generation-date
Suppress version output in the generated file.

--encoding-policy POLICY
Specify how re2c must treat Unicode surrogates. POLICY can be one of the following: fail (abort
with error when surrogate encountered), substitute (silently substitute surrogate with error code
point 0xFFFD), ignore (treat surrogates as normal code points). By default re2c ignores surrogates
(for backward compatibility). Unicode standard says that standalone surrogates are invalid code
points, but different libraries and programs treat them differently.

--input INPUT
Specify re2c input API. INPUT can be one of the following: default, custom.

-S --skeleton
Instead of embedding re2c-generated code into C/C++ source, generate a self-contained program for
the same DFA. Most useful for correctness and performance testing.

--empty-class POLICY
What to do if user inputs empty character class. POLICY can be one of the following: match-empty
(match empty input: pretty illogical, but this is the default for backwards compatibility reason),
match-none (fail to match on any input), error (compilation error). Note that there are various
ways to construct empty class, e.g: [], [^\x00-\xFF], [\x00-\xFF][\x00-\xFF].

--dfa-minimization <table | moore>
Internal algorithm used by re2c to minimize DFA (defaults to moore). Both table filling and
Moore's algorithms should produce identical DFA (up to states relabelling). Table filling
algorithm is much simpler and slower; it serves as a reference implementation.

-1 --single-pass
Deprecated and does nothing (single pass is by default now).

-W Turn on all warnings.

-Werror
Turn warnings into errors. Note that this option along doesn't turn on any warnings, it only
affects those warnings that have been turned on so far or will be turned on later.

-W<warning>
Turn on individual warning.

-Wno-<warning>
Turn off individual warning.

-Werror-<warning>
Turn on individual warning and treat it as error (this implies -W<warning>).

-Wno-error-<warning>
Don't treat this particular warning as error. This doesn't turn off the warning itself.

-Wcondition-order
Warn if the generated program makes implicit assumptions about condition numbering. One should use
either -t, --type-header option or /*!types:re2c*/ directive to generate mapping of condition
names to numbers and use autogenerated condition names.

-Wempty-character-class
Warn if regular expression contains empty character class. From the rational point of view trying
to match empty character class makes no sense: it should always fail. However, for backwards
compatibility reasons re2c allows empty character class and treats it as empty string. Use
--empty-class option to change default behaviour.

-Wmatch-empty-string
Warn if regular expression in a rule is nullable (matches empty string). If DFA runs in a loop and
empty match is unintentional (input position in not advanced manually), lexer may get stuck in
eternal loop.

-Wswapped-range
Warn if range lower bound is greater that upper bound. Default re2c behaviour is to silently swap
range bounds.

-Wundefined-control-flow
Warn if some input strings cause undefined control flow in lexer (the faulty patterns are
reported). This is the most dangerous and common mistake. It can be easily fixed by adding default
rule * (this rule has the lowest priority, matches any code unit and consumes exactly one code
unit).

-Wuseless-escape
Warn if a symbol is escaped when it shouldn't be. By default re2c silently ignores escape, but
this may as well indicate a typo or an error in escape sequence.

INTERFACE CODE

       The user must supply interface code either in the form of C/C++ code (macros, functions, variables, etc.)
       or in the form of INPLACE CONFIGURATIONS.  Which symbols must be defined and which are  optional  depends
       on a particular use case.

       YYCONDTYPE
              In  -c  mode  you  can use -t to generate a file that contains the enumeration used as conditions.
              Each of the values refers to a condition of a rule set.

       YYCTXMARKER
              l-value of type YYCTYPE *.  The generated code saves trailing context backtracking information  in
              YYCTXMARKER.  The  user  only  needs to define this macro if a scanner specification uses trailing
              context in one or more of its regular expressions.

       YYCTYPE
              Type used to hold an input symbol (code unit). Usually char or unsigned char for ASCII, EBCDIC and
              UTF-8, unsigned short for UTF-16 or UCS-2 and unsigned int for UTF-32.

       YYCURSOR
              l-value  of  type  YYCTYPE  * that points to the current input symbol. The generated code advances
              YYCURSOR as symbols are matched. On entry, YYCURSOR is assumed to point to the first character  of
              the current token. On exit, YYCURSOR will point to the first character of the following token.

       YYDEBUG (state, current)
              This  is  only  needed  if  the -d flag was specified. It allows one to easily debug the generated
              parser by calling a user defined function for every state. The function should have the  following
              signature:  void  YYDEBUG  (int state, char current). The first parameter receives the state or -1
              and the second parameter receives the input at the current cursor.

       YYFILL (n)
              The generated code "calls"" YYFILL (n) when the buffer needs (re)filling: at  least  n  additional
              characters  should  be  provided.  YYFILL  (n)  should  adjust  YYCURSOR,  YYLIMIT,  YYMARKER  and
              YYCTXMARKER as needed. Note that for typical programming languages n will be  the  length  of  the
              longest  keyword  plus  one.  The  user  can  place  a comment of the form /*!max:re2c*/ to insert
              YYMAXFILL definition that is set to the maximum length value.

       YYGETCONDITION ()
              This define is used to get the condition prior to entering the scanner code when using -c  switch.
              The value must be initialized with a value from the enumeration YYCONDTYPE type.

       YYGETSTATE ()
              The user only needs to define this macro if the -f flag was specified. In that case, the generated
              code "calls" YYGETSTATE () at the very beginning of the scanner  in  order  to  obtain  the  saved
              state.  YYGETSTATE  ()  must return a signed integer. The value must be either -1, indicating that
              the scanner is entered for the first time, or a value previously saved by YYSETSTATE (s).  In  the
              second case, the scanner will resume operations right after where the last YYFILL (n) was called.

       YYLIMIT
              Expression of type YYCTYPE * that marks the end of the buffer YYLIMIT[-1] is the last character in
              the buffer). The generated code repeatedly compares YYCURSOR to  YYLIMIT  to  determine  when  the
              buffer needs (re)filling.

       YYMARKER
              l-value  of  type  YYCTYPE *.  The generated code saves backtracking information in YYMARKER. Some
              easy scanners might not use this.

       YYMAXFILL
              This will be automatically defined by /*!max:re2c*/ blocks as explained above.

       YYSETCONDITION (c)
              This define is used to set the condition in transition rules. This is only being used when  -c  is
              active and transition rules are being used.

       YYSETSTATE (s)
              The user only needs to define this macro if the -f flag was specified. In that case, the generated
              code "calls" YYSETSTATE just before calling YYFILL (n). The parameter to YYSETSTATE  is  a  signed
              integer  that  uniquely identifies the specific instance of YYFILL (n) that is about to be called.
              Should the user wish to save the state of the scanner and have YYFILL (n) return  to  the  caller,
              all he has to do is store that unique identifer in a variable. Later, when the scannered is called
              again, it will call YYGETSTATE () and resume execution right where it left off. The generated code
              will contain both YYSETSTATE (s) and YYGETSTATE even if YYFILL (n) is being disabled.

SYNTAX

       Code for re2c consists of a set of RULES, NAMED DEFINITIONS and INPLACE CONFIGURATIONS.

   RULES
       Rules  consist of a regular expression (see REGULAR EXPRESSIONS) along with a block of C/C++ code that is
       to be executed when the associated regular expression is matched. You can either start the code  with  an
       opening curly brace or the sequence :=. When the code with a curly brace then re2c counts the brace depth
       and stops looking for code automatically. Otherwise curly braces are not allowed and re2c  stops  looking
       for  code  at the first line that does not begin with whitespace. If two or more rules overlap, the first
       rule is preferred.
          regular-expression { C/C++ code }

          regular-expression := C/C++ code

       There is one special rule: default rule *
          * { C/C++ code }

          * := C/C++ code

       Note that default rule * differs from [^]: default rule has the lowest priority, matches  any  code  unit
       (either  valid or invalid) and always consumes one character; while [^] matches any valid code point (not
       code unit) and can consume multiple code units. In fact, when variable-length encoding is used, * is  the
       only possible way to match invalid input character (see ENCODINGS for details).

       If  -c  is  active then each regular expression is preceded by a list of comma separated condition names.
       Besides normal naming rules there are two special cases: <*> (such rules are merged  to  all  conditions)
       and  <>  (such the rule cannot have an associated regular expression, its code is merged to all actions).
       Non empty rules may further more specify the new condition. In that case re2c will generate the necessary
       code  to  change  the  condition automatically. Rules can use :=> as a shortcut to automatically generate
       code that not only sets the new condition state but also  continues  execution  with  the  new  state.  A
       shortcut rule should not be used in a loop where there is code between the start of the loop and the re2c
       block unless re2c:cond:goto is changed to continue. If code is necessary before  all  rules  (though  not
       simple jumps) you can doso by using <!> pseudo-rules.
          <condition-list> regular-expression { C/C++ code }

          <condition-list> regular-expression := C/C++ code

          <condition-list> * { C/C++ code }

          <condition-list> * := C/C++ code

          <condition-list> regular-expression => condition { C/C++ code }

          <condition-list> regular-expression => condition := C/C++ code

          <condition-list> * => condition { C/C++ code }

          <condition-list> * => condition := C/C++ code

          <condition-list> regular-expression :=> condition

          <*> regular-expression { C/C++ code }

          <*> regular-expression := C/C++ code

          <*> * { C/C++ code }

          <*> * := C/C++ code

          <*> regular-expression => condition { C/C++ code }

          <*> regular-expression => condition := C/C++ code

          <*> * => condition { C/C++ code }

          <*> * => condition := C/C++ code

          <*> regular-expression :=> condition

          <> { C/C++ code }

          <> := C/C++ code

          <> => condition { C/C++ code }

          <> => condition := C/C++ code

          <> :=> condition

          <> :=> condition

          <! condition-list> { C/C++ code }

          <! condition-list> := C/C++ code

          <!> { C/C++ code }

          <!> := C/C++ code

   NAMED DEFINITIONS
       Named definitions are of the form:
          name = regular-expression;

       If -F is active, then named definitions are also of the form:
          name { regular-expression }

   INPLACE CONFIGURATIONS
       re2c:condprefix = yyc;
              Allows  one to specify the prefix used for condition labels. That is this text is prepended to any
              condition label in the generated output file.

       re2c:condenumprefix = yyc;
              Allows one to specify the prefix used for condition values. That is this text is prepended to  any
              condition enum value in the generated output file.

       re2c:cond:divider = /* *********************************** */ ;
              Allows  one  to  customize the devider for condition blocks. You can use @@ to put the name of the
              condition or customize the placeholder using re2c:cond:divider@cond.

       re2c:cond:divider@cond = @@;
              Specifies the placeholder that will be replaced with the condition name in re2c:cond:divider.

       re2c:cond:goto = goto @@; ;
              Allows one to customize the condition goto statements used with :=> style rules. You can use @@ to
              put  the name of the condition or ustomize the placeholder using re2c:cond:goto@cond. You can also
              change this to continue;, which would allow you to continue with the next loop cycle including any
              code between loop start and re2c block.

       re2c:cond:goto@cond = @@;
              Spcifies the placeholder that will be replaced with the condition label in re2c:cond:goto.

       re2c:indent:top = 0;
              Specifies the minimum number of indentation to use. Requires a numeric value greater than or equal
              zero.

       re2c:indent:string = \t ;
              Specifies the string to use for indentation. Requires a string that should contain only whitespace
              unless  you  need this for external tools. The easiest way to specify spaces is to enclude them in
              single or double quotes.  If you do not want any indentation at all you can simply set this to "".

       re2c:yych:conversion = 0;
              When this setting is non zero, then re2c automatically generates  conversion  code  whenever  yych
              gets read. In this case the type must be defined using re2c:define:YYCTYPE.

       re2c:yych:emit = 1;
              Generation of yych can be suppressed by setting this to 0.

       re2c:yybm:hex = 0;
              If set to zero then a decimal table is being used else a hexadecimal table will be generated.

       re2c:yyfill:enable = 1;
              Set  this to zero to suppress generation of YYFILL (n). When using this be sure to verify that the
              generated scanner does not read  behind  input.  Allowing  this  behavior  might  introduce  sever
              security issues to you programs.

       re2c:yyfill:check = 1;
              This can be set 0 to suppress output of the pre condition using YYCURSOR and YYLIMIT which becomes
              useful when YYLIMIT + YYMAXFILL is always accessible.

       re2c:define:YYFILL = YYFILL ;
              Substitution for YYFILL. Note that by default re2c generates  argument  in  braces  and  semicolon
              after  YYFILL.  If  you  need  to  make  YYFILL  an  arbitrary  statement  rather than a call, set
              re2c:define:YYFILL:naked to non-zero and use re2c:define:YYFILL@len  to  denote  formal  parameter
              inside of YYFILL body.

       re2c:define:YYFILL@len = @@ ;
              Any occurrence of this text inside of YYFILL will be replaced with the actual argument.

       re2c:yyfill:parameter = 1;
              Controls  argument  in braces after YYFILL. If zero, agrument is omitted. If non-zero, argument is
              generated unless re2c:define:YYFILL:naked is set to non-zero.

       re2c:define:YYFILL:naked = 0;
              Controls argument in braces and semicolon after YYFILL. If zero, both agrument and  semicolon  are
              omitted.  If  non-zero,  argument  is  generated  unless  re2c:yyfill:parameter is set to zero and
              semicolon is generated unconditionally.

       re2c:startlabel = 0;
              If set to a non zero integer then the start label of the next scanner  blocks  will  be  generated
              even  if  not  used by the scanner itself. Otherwise the normal yy0 like start label is only being
              generated if needed. If set to a text value  then  a  label  with  that  text  will  be  generated
              regardless  of whether the normal start label is being used or not. This setting is being reset to
              0 after a start label has been generated.

       re2c:labelprefix = yy ;
              Allows one to change the prefix of numbered labels. The default is yy and can be  set  any  string
              that is a valid label.

       re2c:state:abort = 0;
              When  not  zero and switch -f is active then the YYGETSTATE block will contain a default case that
              aborts and a -1 case is used for initialization.

       re2c:state:nextlabel = 0;
              Used when -f is active to control whether the YYGETSTATE block is  followed  by  a  yyNext:  label
              line.   Instead  of  using  yyNext  you  can  usually also use configuration startlabel to force a
              specific start label or default to yy0 as start label. Instead of using a dedicated  label  it  is
              often  better  to  separate  the  YYGETSTATE  code  from  the  actual  scanner  code  by placing a
              /*!getstate:re2c*/ comment.

       re2c:cgoto:threshold = 9;
              When -g is active this value specifies the complexity threshold that triggers generation  of  jump
              tables  rather  than using nested if's and decision bitfields. The threshold is compared against a
              calculated estimation of if-s needed where every used bitmap divides the threshold by 2.

       re2c:yych:conversion = 0;
              When the input uses signed characters and -s or -b switches are  in  effect  re2c  allows  one  to
              automatically  convert  to  the  unsigned  character  type that is then necessary for its internal
              single character. When this setting is zero or an empty string the conversion is disabled. Using a
              non zero number the conversion is taken from YYCTYPE. If that is given by an inplace configuration
              that value is being used. Otherwise it will be (YYCTYPE) and changes to that configuration are  no
              longer  possible.  When  this  setting is a string the braces must be specified. Now assuming your
              input is a char * buffer and you are using  above  mentioned  switches  you  can  set  YYCTYPE  to
              unsigned char and this setting to either 1 or (unsigned char).

       re2c:define:YYCONDTYPE = YYCONDTYPE ;
              Enumeration used for condition support with -c mode.

       re2c:define:YYCTXMARKER = YYCTXMARKER ;
              Allows  one  to  overwrite the define YYCTXMARKER and thus avoiding it by setting the value to the
              actual code needed.

       re2c:define:YYCTYPE = YYCTYPE ;
              Allows one to overwrite the define YYCTYPE and thus avoiding it by setting the value to the actual
              code needed.

       re2c:define:YYCURSOR = YYCURSOR ;
              Allows  one  to  overwrite  the  define  YYCURSOR and thus avoiding it by setting the value to the
              actual code needed.

       re2c:define:YYDEBUG = YYDEBUG ;
              Allows one to overwrite the define YYDEBUG and thus avoiding it by setting the value to the actual
              code needed.

       re2c:define:YYGETCONDITION = YYGETCONDITION ;
              Substitution  for YYGETCONDITION. Note that by default re2c generates braces after YYGETCONDITION.
              Set re2c:define:YYGETCONDITION:naked to non-zero to omit braces.

       re2c:define:YYGETCONDITION:naked = 0;
              Controls braces after YYGETCONDITION. If  zero,  braces  are  omitted.  If  non-zero,  braces  are
              generated.

       re2c:define:YYSETCONDITION = YYSETCONDITION ;
              Substitution  for  YYSETCONDITION.  Note  that  by  default  re2c generates argument in braces and
              semicolon after YYSETCONDITION. If you need to make YYSETCONDITION an arbitrary  statement  rather
              than     a     call,     set     re2c:define:YYSETCONDITION:naked     to    non-zero    and    use
              re2c:define:YYSETCONDITION@cond to denote formal parameter inside of YYSETCONDITION body.

       re2c:define:YYSETCONDITION@cond = @@ ;
              Any occurrence of this text inside of YYSETCONDITION will be replaced with the actual argument.

       re2c:define:YYSETCONDITION:naked = 0;
              Controls argument in braces and  semicolon  after  YYSETCONDITION.  If  zero,  both  agrument  and
              semicolon are omitted. If non-zero, both argument and semicolon are generated.

       re2c:define:YYGETSTATE = YYGETSTATE ;
              Substitution  for  YYGETSTATE.  Note  that  by default re2c generates braces after YYGETSTATE. Set
              re2c:define:YYGETSTATE:naked to non-zero to omit braces.

       re2c:define:YYGETSTATE:naked = 0;
              Controls braces after YYGETSTATE. If zero, braces are omitted. If non-zero, braces are generated.

       re2c:define:YYSETSTATE = YYSETSTATE ;
              Substitution for YYSETSTATE. Note that by default re2c generates argument in braces and  semicolon
              after  YYSETSTATE.  If  you need to make YYSETSTATE an arbitrary statement rather than a call, set
              re2c:define:YYSETSTATE:naked to non-zero and  use  re2c:define:YYSETSTATE@cond  to  denote  formal
              parameter inside of YYSETSTATE body.

       re2c:define:YYSETSTATE@state = @@ ;
              Any occurrence of this text inside of YYSETSTATE will be replaced with the actual argument.

       re2c:define:YYSETSTATE:naked = 0;
              Controls  argument  in braces and semicolon after YYSETSTATE. If zero, both agrument and semicolon
              are omitted. If non-zero, both argument and semicolon are generated.

       re2c:define:YYLIMIT = YYLIMIT ;
              Allows one to overwrite the define YYLIMIT and thus avoiding it by setting the value to the actual
              code needed.

       re2c:define:YYMARKER = YYMARKER ;
              Allows  one  to  overwrite  the  define  YYMARKER and thus avoiding it by setting the value to the
              actual code needed.

       re2c:label:yyFillLabel = yyFillLabel ;
              Allows one to overwrite the name of the label yyFillLabel.

       re2c:label:yyNext = yyNext ;
              Allows one to overwrite the name of the label yyNext.

       re2c:variable:yyaccept = yyaccept;
              Allows one to overwrite the name of the variable yyaccept.

       re2c:variable:yybm = yybm ;
              Allows one to overwrite the name of the variable yybm.

       re2c:variable:yych = yych ;
              Allows one to overwrite the name of the variable yych.

       re2c:variable:yyctable = yyctable ;
              When both -c and -g are active then re2c uses this variable to generate a static  jump  table  for
              YYGETCONDITION.

       re2c:variable:yystable = yystable ;
              Deprecated.

       re2c:variable:yytarget = yytarget ;
              Allows one to overwrite the name of the variable yytarget.

   REGULAR EXPRESSIONS
       "foo"  literal string "foo". ANSI-C escape sequences can be used.

       'foo'  literal  string  "foo" (characters [a-zA-Z] treated case-insensitive). ANSI-C escape sequences can
              be used.

       [xyz]  character class; in this case, regular expression matches either x, y, or z.

       [abj-oZ]
              character class with a range in it; matches a, b, any letter from j through o or Z.

       [^class]
              inverted character class.

       r \ s  match any r which isn't s. r and s must be regular expressions which can be expressed as character
              classes.

       r*     zero or more occurrences of r.

       r+     one or more occurrences of r.

       r?     optional r.

       (r)    r; parentheses are used to override precedence.

       r s    r followed by s (concatenation).

       r | s  either r or s (alternative).

       r / s  r  but  only  if  it  is  followed by s. Note that s is not part of the matched text. This type of
              regular expression is called "trailing context". Trailing context can only be the end  of  a  rule
              and not part of a named definition.

       r{n}   matches r exactly n times.

       r{n,}  matches r at least n times.

       r{n,m} matches r at least n times, but not more than m times.

       .      match any character except newline.

       name   matches named definition as specified by name only if -F is off. If -F is active then this behaves
              like it was enclosed in double quotes and matches the string "name".

       Character classes and string literals may contain octal or  hexadecimal  character  definitions  and  the
       following  set  of  escape  sequences: \a, \b, \f, \n, \r, \t, \v, \\. An octal character is defined by a
       backslash followed by its three octal digits (e.g. \377).  Hexadecimal characters  from  0  to  0xFF  are
       defined by backslash, a lower cased x and two hexadecimal digits (e.g. \x12). Hexadecimal characters from
       0x100 to 0xFFFF are defined by backslash, a lower cased \u or an upper  cased  \X  and  four  hexadecimal
       digits  (e.g.  \u1234).   Hexadecimal  characters from 0x10000 to 0xFFFFffff are defined by backslash, an
       upper cased \U and eight hexadecimal digits (e.g. \U12345678).

       The only portable "any" rule is the default rule *.

SCANNER WITH STORABLE STATES

       When the -f flag is specified, re2c generates a scanner that can store its current state, return  to  the
       caller, and later resume operations exactly where it left off.

       The default operation of re2c is a "pull" model, where the scanner asks for extra input whenever it needs
       it. However, this mode of operation assumes that the scanner is the "owner" the parsing  loop,  and  that
       may not always be convenient.

       Typically,  if  there  is a preprocessor ahead of the scanner in the stream, or for that matter any other
       procedural source of data, the scanner cannot "ask" for more data unless both scanner and source live  in
       a separate threads.

       The -f flag is useful for just this situation: it lets users design scanners that work in a "push" model,
       i.e. where data is fed to the scanner chunk by chunk. When the scanner runs out of data  to  consume,  it
       just  stores  its state, and return to the caller. When more input data is fed to the scanner, it resumes
       operations exactly where it left off.

       Changes needed compared to the "pull" model:

       • User has to supply macros YYSETSTATE () and YYGETSTATE (state).

       • The -f option inhibits declaration of yych and yyaccept. So the user has to  declare  these.  Also  the
         user  has  to save and restore these.  In the example examples/push_model/push.re these are declared as
         fields of the (C++) class of which the scanner is a method, so they do not need  to  be  saved/restored
         explicitly.  For  C  they  could  e.g.  be made macros that select fields from a structure passed in as
         parameter.  Alternatively, they could be declared as local variables, saved with  YYFILL  (n)  when  it
         decides  to  return and restored at entry to the function. Also, it could be more efficient to save the
         state from YYFILL (n) because YYSETSTATE (state) is called unconditionally.  YYFILL  (n)  however  does
         not get state as parameter, so we would have to store state in a local variable by YYSETSTATE (state).

       • Modify YYFILL (n) to return (from the function calling it) if more input is needed.

       • Modify caller to recognise if more input is needed and respond appropriately.

       • The  generated  code  will  contain  a  switch block that is used to restores the last state by jumping
         behind the corrspoding YYFILL (n) call. This code is automatically generated in the epilog of the first
         /*!re2c  */ block. It is possible to trigger generation of the YYGETSTATE () block earlier by placing a
         /*!getstate:re2c*/ comment. This is especially useful when the scanner code should be wrapped inside  a
         loop.

       Please  see examples/push_model/push.re for "push" model scanner. The generated code can be tweaked using
       inplace configurations state:abort and state:nextlabel.

SCANNER WITH CONDITION SUPPORT

You can preceed regular expressions with a list of condition names when using the -c switch. In this case
re2c generates scanner blocks for each conditon. Where each of the generated blocks has its own
precondition. The precondition is given by the interface define YYGETCONDITON() and must be of type
YYCONDTYPE.

There are two special rule types. First, the rules of the condition <*> are merged to all conditions
(note that they have lower priority than other rules of that condition). And second the empty condition
list allows one to provide a code block that does not have a scanner part. Meaning it does not allow any
regular expression. The condition value referring to this special block is always the one with the
enumeration value 0. This way the code of this special rule can be used to initialize a scanner. It is in
no way necessary to have these rules: but sometimes it is helpful to have a dedicated uninitialized
condition state.

Non empty rules allow one to specify the new condition, which makes them transition rules. Besides
generating calls for the define YYSETCONDTITION no other special code is generated.

There is another kind of special rules that allow one to prepend code to any code block of all rules of a
certain set of conditions or to all code blocks to all rules. This can be helpful when some operation is
common among rules. For instance this can be used to store the length of the scanned string. These
special setup rules start with an exclamation mark followed by either a list of conditions <! condition,
... > or a star <!*>. When re2c generates the code for a rule whose state does not have a setup rule and
a star'd setup rule is present, than that code will be used as setup code.

ENCODINGS

re2c supports the following encodings: ASCII (default), EBCDIC (-e), UCS-2 (-w), UTF-16 (-x), UTF-32 (-u)
and UTF-8 (-8). See also inplace configuration re2c:flags.

The following concepts should be clarified when talking about encoding. Code point is an abstract
number, which represents single encoding symbol. Code unit is the smallest unit of memory, which is used
in the encoded text (it corresponds to one character in the input stream). One or more code units can be
needed to represent a single code point, depending on the encoding. In fixed-length encoding, each code
point is represented with equal number of code units. In variable-length encoding, different code points
can be represented with different number of code units.

ASCII is a fixed-length encoding. Its code space includes 0x100 code points, from 0 to 0xFF. One code
point is represented with exactly one 1-byte code unit, which has the same value as the code
point. Size of YYCTYPE must be 1 byte.

EBCDIC is a fixed-length encoding. Its code space includes 0x100 code points, from 0 to 0xFF. One code
point is represented with exactly one 1-byte code unit, which has the same value as the code
point. Size of YYCTYPE must be 1 byte.

UCS-2 is a fixed-length encoding. Its code space includes 0x10000 code points, from 0 to 0xFFFF. One
code point is represented with exactly one 2-byte code unit, which has the same value as the code
point. Size of YYCTYPE must be 2 bytes.

UTF-16 is a variable-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF
and from 0xE000 to 0x10FFFF. One code point is represented with one or two 2-byte code units. Size
of YYCTYPE must be 2 bytes.

UTF-32 is a fixed-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and
from 0xE000 to 0x10FFFF. One code point is represented with exactly one 4-byte code unit. Size of
YYCTYPE must be 4 bytes.

UTF-8 is a variable-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF
and from 0xE000 to 0x10FFFF. One code point is represented with sequence of one, two, three or
four 1-byte code units. Size of YYCTYPE must be 1 byte.

In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not valid Unicode code points, any
encoded sequence of code units, that would map to Unicode code points in the range 0xD800-0xDFFF, is
ill-formed. The user can control how re2c treats such ill-formed sequences with --encoding-policy
<policy> flag (see OPTIONS for full explanation).

For some encodings, there are code units, that never occur in valid encoded stream (e.g. 0xFF byte in
UTF-8). If the generated scanner must check for invalid input, the only true way to do so is to use
default rule *. Note, that full range rule [^] won't catch invalid code units when variable-length
encoding is used ([^] means "all valid code points", while default rule * means "all possible code
units").

GENERIC INPUT API

       re2c usually operates on input using pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT.

       Generic input API (enabled with --input custom switch) allows one to customize input operations. In  this
       mode, re2c will express all operations on input in terms of the following primitives:

                                ┌────────────────┬───────────────────────────────────────┐
                                │YYPEEK ()       │ get current input character           │
                                ├────────────────┼───────────────────────────────────────┤
                                │YYSKIP ()       │ advance to the next character         │
                                ├────────────────┼───────────────────────────────────────┤
                                │YYBACKUP ()     │ backup current input position         │
                                ├────────────────┼───────────────────────────────────────┤
                                │YYBACKUPCTX ()  │ backup  current  input  position  for │
                                │                │ trailing context                      │
                                ├────────────────┼───────────────────────────────────────┤
                                │YYRESTORE ()    │ restore current input position        │
                                ├────────────────┼───────────────────────────────────────┤
                                │YYRESTORECTX () │ restore current  input  position  for │
                                │                │ trailing context                      │
                                ├────────────────┼───────────────────────────────────────┤
                                │YYLESSTHAN (n)  │ check if less than n input characters │
                                │                │ are left                              │
                                └────────────────┴───────────────────────────────────────┘

       A couple of useful links that provide some examples:

       1. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-13-input_model.html

       2. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-15-input_model_custom.html

AUTHORS

       Peter Bumbulis   peter@csg.uwaterloo.ca

       Brian Young      bayoung@acm.org

       Dan Nuffer       nuffer@users.sourceforge.net

       Marcus Boerger   helly@users.sourceforge.net

       Hartmut Kaiser   hkaiser@users.sourceforge.net

       Emmanuel Mogenet mgix@mgix.com

       Ulya Trofimovich skvadrik@gmail.com

VERSION INFORMATION

       This manpage describes re2c version 0.16, package date 21 Jan 2016.

                                                                                                         RE2C(1)