Provided by: tcllib_1.17-dfsg-1_all bug

NAME

       htmlparse - Procedures to parse HTML strings

SYNOPSIS

       package require Tcl  8.2

       package require struct::stack  1.3

       package require cmdline  1.1

       package require htmlparse  ?1.2.1?

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag

       ::htmlparse::mapEscapes html

       ::htmlparse::2tree html tree

       ::htmlparse::removeVisualFluff tree

       ::htmlparse::removeFormDefs tree

_________________________________________________________________________________________________

DESCRIPTION

       The  htmlparse  package  provides  commands that allow libraries and applications to parse
       HTML in a string into a representation of their choice.

       The following commands are available:

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html
              This command is the basic parser for HTML. It takes an HTML string, parses  it  and
              invokes  a  command  prefix  for every tag encountered. It is not necessary for the
              HTML to be valid for this parser to function.  It  is  the  responsibility  of  the
              command  invoked for every tag to check this. Another responsibility of the invoked
              command  is  the  handling  of  tag  attributes  and  character  entities  (escaped
              characters).  The  parser provides the un-interpreted tag attributes to the invoked
              command to aid in the former, and the package at large provides a  helper  command,
              ::htmlparse::mapEscapes,  to  aid  in  the  handling of the latter. The parser does
              ignore leading DOCTYPE declarations and all valid HTML comments it encounters.

              All information beyond the HTML string itself is specified via options,  these  are
              explained below.

              To help understand the options, some more background information about the parser.

              It  is  capable  of detecting incomplete tags in the HTML string given to it. Under
              normal circumstances this will cause the parser to  throw  an  error,  but  if  the
              option -incvar is used to specify a global (or namespace) variable, the parser will
              store the incomplete part of the input into this variable instead.  This  will  aid
              greatly  in  the handling of incrementally arriving HTML, as the parser will handle
              whatever it can and defer the handling of the incomplete part until more  data  has
              arrived.

              Another  feature  of the parser are its two possible modes of operation. The normal
              mode is activated if the option -queue is not present on the command line  invoking
              the parser. If it is present, the parser will go into the incremental mode instead.

              The  main  difference  is  that a parser in normal mode will immediately invoke the
              command prefix for each tag it encounters. In incremental mode however  the  parser
              will  generate  a  number  of scripts which invoke the command prefix for groups of
              tags in the HTML string and then store these scripts in the specified queue. It  is
              then  the responsibility of the caller of the parser to ensure the execution of the
              scripts in the queue.

              Note: The queue object given to the parser has to provide the same interface as the
              queue defined in tcllib -> struct. This means, for example, that all queues created
              via that tcllib module can be immediately used here. Still, the queue doesn't  have
              to come from tcllib -> struct as long as the same interface is provided.

              In both modes the parser will return an empty string to the caller.

              The  -split option may be given to a parser in incremental mode to specify the size
              of the groups it creates. In other words, -split 5 means that each of the generated
              scripts will invoke the command prefix for 5 consecutive tags in the HTML string. A
              parser in normal mode will ignore this option and its value.

              The option -vroot specifies a virtual root tag. A parser in normal mode will invoke
              the command prefix for it immediately before and after it processes the tags in the
              HTML, thus simulating that the HTML  string  is  enclosed  in  a  <vroot>  </vroot>
              combination.  In  incremental  mode  however  the  parser  is unable to provide the
              closing virtual root as it never knows when the input is complete. In this case the
              first  script generated by each invocation of the parser will contain an invocation
              of the command prefix for the virtual root as its  first  command.   The  following
              options are available:

              -cmd cmd
                     The  command  prefix to invoke for every tag in the HTML string. Defaults to
                     ::htmlparse::debugCallback.

              -vroot tag
                     The virtual root tag to add around the HTML in normal mode.  In  incremental
                     mode  it  is  the first tag in each chunk processed by the parser, but there
                     will be no closing tags. Defaults to hmstart.

              -split n
                     The size of the groups produced by an incremental mode parser. Ignored  when
                     in normal mode. Defaults to 10. Values <= 0 are not allowed.

              -incvar var
                     The name of the variable where to store any incomplete HTML into. This makes
                     most sense for the incremental mode. The parser will throw an  error  if  it
                     sees  incomplete  HTML and has no place to store it to. This makes sense for
                     the normal mode. Only  incomplete  tags  are  detected,  not  missing  tags.
                     Optional, defaults to 'no variable'.

              Interface to the command prefix
                     In normal mode the parser will invoke the command prefix with four arguments
                     appended. See ::htmlparse::debugCallback for a description.

                     In incremental mode, however, the generated scripts will invoke the  command
                     prefix  with  five  arguments  appended. The last four of these are the same
                     which were mentioned above. The first is a placeholder string (@win@) for  a
                     clientdata  value  to  be  supplied later during the actual execution of the
                     generated scripts. This could be a tk window path, for example. This  allows
                     the  user of this package to preprocess HTML strings without committing them
                     to a specific window, object, whatever during parsing. This  connection  can
                     be  made  later.  This  also means that it is possible to cache preprocessed
                     HTML. Of course, nothing prevents the user of the parser from replacing  the
                     placeholder with an empty string.

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag
              This  command  is the standard callback used by the parser in ::htmlparse::parse if
              none was specified by the user. It simply dumps  its  arguments  to  stdout.   This
              callback can be used for both normal and incremental mode of the calling parser. In
              other words, it accepts four  or  five  arguments.  The  last  four  arguments  are
              described  below.  The optional fifth argument contains the clientdata value passed
              to the callback by a parser in incremental mode. All callbacks have to  follow  the
              signature  of  this  command  in  the  last  four  arguments, and callbacks used in
              incremental parsing have to follow this signature in the last five arguments.

              The first argument, clientdata, is optional and present only  if  this  command  is
              invoked  by  a  parser  in  incremental mode. It contains whatever the user of this
              package wishes.

              The second argument, tag, contains the name of the tag which is currently processed
              by the parser.

              The third argument, slash, is either empty or contains a slash character. It allows
              the callback to distinguish between opening  (slash  is  empty)  and  closing  tags
              (slash contains a slash character).

              The  fourth  argument, param, contains the un-interpreted list of parameters to the
              tag.

              The fifth and last argument, textBehindTheTag,  contains  the  text  found  by  the
              parser behind the tag named in tag.

       ::htmlparse::mapEscapes html
              This  command  takes  a  HTML  string,  substitutes all escape sequences with their
              actual characters and then returns the resulting string.  HTML strings which do not
              contain escape sequences are returned unchanged.

       ::htmlparse::2tree html tree
              This  command is a wrapper around ::htmlparse::parse which takes an HTML string (in
              html) and converts it into a tree containing the logical structure  of  the  parsed
              document.  The  name  of  the  tree  is given to the command as its second argument
              (tree). The command does not generate the tree  by  itself  but  expects  that  the
              caller  provided  it  with  an  existing  and  empty tree. It also expects that the
              specified tree object follows the same interface as the tree object  in  tcllib  ->
              struct.  It  doesn't have to be from tcllib -> struct, but it must provide the same
              interface.

              The internal callback does some basic  checking  of  HTML  validity  and  tries  to
              recover  from the most basic errors. The command returns the contents of its second
              argument. Side effects are the creation and manipulation of a tree object.

              Each node in the generated tree represent one tag in the input. The name of the tag
              is  stored  in  the attribute type of the node. Any html attributes coming with the
              tag are stored unmodified in the attribute data of the tag.  In  other  words,  the
              command does not parse html attributes into their names and values.

              If  a  tag contains text its node will have children of type PCDATA containing this
              text. The text will be stored in the attribute data of these children.

       ::htmlparse::removeVisualFluff tree
              This command walks a tree as generated by ::htmlparse::2tree and  removes  all  the
              nodes  which  represent  visual  tags  and  not structural ones. The purpose of the
              command is to make the tree easier to  navigate  without  getting  bogged  down  in
              visual information not relevant to the search. Its only argument is the name of the
              tree to cut down.

       ::htmlparse::removeFormDefs tree
              Like ::htmlparse::removeVisualFluff this command is here to cut down on the size of
              the  tree  as  generated  by  ::htmlparse::2tree. It removes all nodes representing
              forms and form elements. Its only argument is the name of the tree to cut down.

BUGS, IDEAS, FEEDBACK

       This document, and the package it describes,  will  undoubtedly  contain  bugs  and  other
       problems.    Please  report  such  in  the  category  htmlparse  of  the  Tcllib  Trackers
       [http://core.tcl.tk/tcllib/reportlist].  Please also report any ideas for enhancements you
       may have for either package and/or documentation.

SEE ALSO

       struct::tree

KEYWORDS

       html, parsing, queue, tree

CATEGORY

       Text processing