Ubuntu Manpage: string::token - Regex based iterative lexing

NAME

       string::token - Regex based iterative lexing

SYNOPSIS

       package require Tcl  8.5

       package require string::token  ?1?

       package require fileutil

       ::string token text lex string

       ::string token file lex path

       ::string token chomp lex startvar string resultvar

_________________________________________________________________________________________________

DESCRIPTION

This package provides commands for regular expression based lexing (tokenization) of
strings.

The complete set of procedures is described below.

::string token text lex string
This command takes an ordered dictionary lex mapping regular expressions to labels,
and tokenizes the string according to this dictionary.

The result of the command is a list of tokens, where each token is a 3-element list
of label, start- and end-index in the string.

The command will throw an error if it is not able to tokenize the whole string.

::string token file lex path
This command is a convenience wrapper around ::string token text above, and
fileutil::cat, enabling the easy tokenization of whole files. Note that this
command loads the file wholly into memory before starting to process it.

If the file is too large for this mode of operation a command directly based on
::string token chomp below will be necessary.

::string token chomp lex startvar string resultvar
This command is the work horse underlying ::string token text above. It is exposed
to enable users to write their own lexers, which, for example may apply different
lexing dictionaries according to some internal state, etc.

The command takes an ordered dictionary lex mapping regular expressions to labels,
a variable startvar which indicates where to start lexing in the input string, and
a result variable resultvar to extend.

The result of the command is a tri-state numeric code indicating one of

0 No token found.

1 Token found.

2 End of string reached.

Note that recognition of a token from lex is started at the character index in
startvar.

If a token was recognized (status 1) the command will update the index in startvar
to point to the first character of the string past the recognized token, and it
will further extend the resultvar with a 3-element list containing the label
associated with the regular expression of the token, and the start- and end-
character-indices of the token in string.

Neither startvar nor resultvar will be updated if no token is recognized at all.

Note that the regular expressions are applied (tested) in the order they are
specified in lex, and the first matching pattern stops the process. Because of this
it is recommended to specify the patterns to lex with from the most specific to the
most general.

Further note that all regex patterns are implicitly prefixed with the constraint
escape A to ensure that a match starts exactly at the character index found in
startvar.

BUGS, IDEAS, FEEDBACK

       This  document,  and  the  package  it  describes, will undoubtedly contain bugs and other
       problems.   Please  report  such  in  the  category  textutil  of  the   Tcllib   Trackers
       [http://core.tcl.tk/tcllib/reportlist].  Please also report any ideas for enhancements you
       may have for either package and/or documentation.

KEYWORDS

       lexing, regex, string, tokenization

NAME

SYNOPSIS

DESCRIPTION

BUGS, IDEAS, FEEDBACK

KEYWORDS

CATEGORY