Provided by: tcllib_1.17-dfsg-1_all bug

NAME

       string::token - Regex based iterative lexing

SYNOPSIS

       package require Tcl  8.5

       package require string::token  ?1?

       package require fileutil

       ::string token text lex string

       ::string token file lex path

       ::string token chomp lex startvar string resultvar

_________________________________________________________________________________________________

DESCRIPTION

       This  package  provides  commands  for  regular  expression based lexing (tokenization) of
       strings.

       The complete set of procedures is described below.

       ::string token text lex string
              This command takes an ordered dictionary lex mapping regular expressions to labels,
              and tokenizes the string according to this dictionary.

              The result of the command is a list of tokens, where each token is a 3-element list
              of label, start- and end-index in the string.

              The command will throw an error if it is not able to tokenize the whole string.

       ::string token file lex path
              This command is a  convenience  wrapper  around  ::string  token  text  above,  and
              fileutil::cat,  enabling  the  easy  tokenization  of  whole files.  Note that this
              command loads the file wholly into memory before starting to process it.

              If the file is too large for this mode of operation a  command  directly  based  on
              ::string token chomp below will be necessary.

       ::string token chomp lex startvar string resultvar
              This  command is the work horse underlying ::string token text above. It is exposed
              to enable users to write their own lexers, which, for example may  apply  different
              lexing dictionaries according to some internal state, etc.

              The  command takes an ordered dictionary lex mapping regular expressions to labels,
              a variable startvar which indicates where to start lexing in the input string,  and
              a result variable resultvar to extend.

              The result of the command is a tri-state numeric code indicating one of

              0      No token found.

              1      Token found.

              2      End of string reached.

              Note  that  recognition  of  a  token from lex is started at the character index in
              startvar.

              If a token was recognized (status 1) the command will update the index in  startvar
              to  point  to  the  first character of the string past the recognized token, and it
              will further extend the resultvar  with  a  3-element  list  containing  the  label
              associated  with  the  regular  expression  of  the  token, and the start- and end-
              character-indices of the token in string.

              Neither startvar nor resultvar will be updated if no token is recognized at all.

              Note that the regular expressions are  applied  (tested)  in  the  order  they  are
              specified in lex, and the first matching pattern stops the process. Because of this
              it is recommended to specify the patterns to lex with from the most specific to the
              most general.

              Further  note  that  all regex patterns are implicitly prefixed with the constraint
              escape A to ensure that a match starts exactly at  the  character  index  found  in
              startvar.

BUGS, IDEAS, FEEDBACK

       This  document,  and  the  package  it  describes, will undoubtedly contain bugs and other
       problems.   Please  report  such  in  the  category  textutil  of  the   Tcllib   Trackers
       [http://core.tcl.tk/tcllib/reportlist].  Please also report any ideas for enhancements you
       may have for either package and/or documentation.

KEYWORDS

       lexing, regex, string, tokenization

CATEGORY

       Text processing