Ubuntu Manpage: HTML::TableParser - Extract data from an HTML table

Provided by: libhtml-tableparser-perl_0.40-1_all

NAME

       HTML::TableParser - Extract data from an HTML table

SYNOPSIS

         use HTML::TableParser;

         @reqs = (
                  {
                   id => 1.1,                    # id for embedded table
                   hdr => \&header,              # function callback
                   row => \&row,                 # function callback
                   start => \&start,             # function callback
                   end => \&end,                 # function callback
                   udata => { Snack => 'Food' }, # arbitrary user data
                  },
                  {
                   id => 1,                      # table id
                   cols => [ 'Object Type',
                             qr/object/ ],       # column name matches
                   obj => $obj,                  # method callbacks
                  },
                 );

         # create parser object
         $p = HTML::TableParser->new( \@reqs,
                          { Decode => 1, Trim => 1, Chomp => 1 } );
         $p->parse_file( 'foo.html' );

         # function callbacks
         sub start {
           my ( $id, $line, $udata ) = @_;
           #...
         }

         sub end {
           my ( $id, $line, $udata ) = @_;
           #...
         }

         sub header {
           my ( $id, $line, $cols, $udata ) = @_;
           #...
         }

         sub row  {
           my ( $id, $line, $cols, $udata ) = @_;
           #...
         }

DESCRIPTION

HTML::TableParser uses HTML::Parser to extract data from an HTML table. The data is
returned via a series of user defined callback functions or methods. Specific tables may
be selected either by a matching a unique table id or by matching against the column
names. Multiple (even nested) tables may be parsed in a document in one pass.

Table Identification
Each table is given a unique id, relative to its parent, based upon its order and nesting.
The first top level table has id 1, the second 2, etc. The first table nested in table 1
has id 1.1, the second 1.2, etc. The first table nested in table 1.1 has id 1.1.1, etc.
These, as well as the tables' column names, may be used to identify which tables to parse.

Data Extraction
As the parser traverses a selected table, it will pass data to user provided callback
functions or methods after it has digested particular structures in the table. All
functions are passed the table id (as described above), the line number in the HTML source
where the table was found, and a reference to any table specific user provided data.

Table Start
The start callback is invoked when a matched table has been found.

Table End
The end callback is invoked after a matched table has been parsed.

Header The hdr callback is invoked after the table header has been read in. Some tables
do not use the <th> tag to indicate a header, so this function may not be called.
It is passed the column names.

Row The row callback is invoked after a row in the table has been read. It is passed
the column data.

Warn The warn callback is invoked when a non-fatal error occurs during parsing. Fatal
errors croak.

New This is the class method to call to create a new object when HTML::TableParser is
supposed to create new objects upon table start.

Callback API
Callbacks may be functions or methods or a mixture of both. In the latter case, an object
must be passed to the constructor. (More on that later.)

The callbacks are invoked as follows:

start( $tbl_id, $line_no, $udata );

end( $tbl_id, $line_no, $udata );

hdr( $tbl_id, $line_no, \@col_names, $udata );

row( $tbl_id, $line_no, \@data, $udata );

warn( $tbl_id, $line_no, $message, $udata );

new( $tbl_id, $udata );

Data Cleanup
There are several cleanup operations that may be performed automatically:

Chomp chomp() the data

Decode Run the data through HTML::Entities::decode.

DecodeNBSP
Normally HTML::Entitites::decode changes a non-breaking space into a character
which doesn't seem to be matched by Perl's whitespace regexp. Setting this
attribute changes the HTML "nbsp" character to a plain 'ol blank.

Trim remove leading and trailing white space.

Data Organization
Column names are derived from cells delimited by the <th> and </th> tags. Some tables have
header cells which span one or more columns or rows to make things look nice.
HTML::TableParser determines the actual number of columns used and provides column names
for each column, repeating names for spanned columns and concatenating spanned rows and
columns. For example, if the table header looks like this:

+----+--------+----------+-------------+-------------------+
| | | Eq J2000 | | Velocity/Redshift |
| No | Object |----------| Object Type |-------------------|
| | | RA | Dec | | km/s | z | Qual |
+----+--------+----------+-------------+-------------------+

The columns will be:

No
Object
Eq J2000 RA
Eq J2000 Dec
Object Type
Velocity/Redshift km/s
Velocity/Redshift z
Velocity/Redshift Qual

Row data are derived from cells delimited by the <td> and </td> tags. Cells which span
more than one column or row are handled correctly, i.e. the values are duplicated in the
appropriate places.

METHODS

       new
                  $p = HTML::TableParser->new( \@reqs, \%attr );

               This is the class constructor.  It is passed a list of table requests as well as
               attributes which specify defaults for common operations.  Table requests are
               documented in "Table Requests".

               The %attr hash provides default values for some of the table request attributes,
               namely the data cleanup operations ( "Chomp", "Decode", "Trim" ), and the multi
               match attribute "MultiMatch", i.e.,

                 $p = HTML::TableParser->new( \@reqs, { Chomp => 1 } );

               will set Chomp on for all of the table requests, unless overridden by them.  The
               data cleanup operations are documented above; "MultiMatch" is documented in "Table
               Requests".

               Decode defaults to on; all of the others default to off.

       parse_file
               This is the same function as in HTML::Parser.

       parse   This is the same function as in HTML::Parser.

Table Requests

       A table request is a hash used by HTML::TableParser to determine which tables are to be
       parsed, the callbacks to be invoked, and any data cleanup.  There may be multiple requests
       processed by one call to the parser; each table is associated with a single request (even
       if several requests match the table).

       A single request may match several tables, however unless the MultiMatch attribute is
       specified for that request, it will be used for the first matching table only.

       A table request which matches a table id of "DEFAULT" will be used as a catch-all request,
       and will match all tables not matched by other requests.  Please note that tables are
       compared to the requests in the order that the latter are passed to the new() method;
       place the DEFAULT method last for proper behavior.

   Identifying tables to parse
       HTML::TableParser needs to be told which tables to parse.  This can be done by matching
       table ids or column names, or a combination of both.  The table request hash elements
       dedicated to this are:

       id      This indicates a match on table id.  It can take one of these forms:

               exact match
                         id => $match
                         id => '1.2'

                       Here $match is a scalar which is compared directly to the table id.

               regular expression
                         id => $re
                         id => qr/1\.\d+\.2/

                       $re is a regular expression, which must be constructed with the "qr//"
                       operator.

               subroutine
                         id => \&my_match_subroutine
                         id => sub { my ( $id, $oids ) = @_ ;
                                  $oids[0] > 3 && $oids[1] < 2 }

                       Here "id" is assigned a coderef to a subroutine which returns true if the
                       table matches, false if not.  The subroutine is passed two arguments: the
                       table id as a scalar string ( e.g. 1.2.3) and the table id as an arrayref
                       (e.g. "$oids = [ 1, 2, 3]").

               "id" may be passed an array containing any combination of the above:

                 id => [ '1.2', qr/1\.\d+\.2/, sub { ... } ]

               Elements in the array may be preceded by a modifier indicating the action to be
               taken if the table matches on that element.  The modifiers and their meanings are:

               "-"     If the id matches, it is explicitly excluded from being processed by this
                       request.

               "--"    If the id matches, it is skipped by all requests.

               "+"     If the id matches, it will be processed by this request.  This is the
                       default action.

               An example:

                 id => [ '-', '1.2', 'DEFAULT' ]

               indicates that this request should be used for all tables, except for table 1.2.

                 id => [ '--', '1.2' ]

               Table 2 is just plain skipped altogether.

       cols    This indicates a match on column names.  It can take one of these forms:

               exact match
                         cols => $match
                         cols => 'Snacks01'

                       Here $match is a scalar which is compared directly to the column names.
                       If any column matches, the table is processed.

               regular expression
                         cols => $re
                         cols => qr/Snacks\d+/

                       $re is a regular expression, which must be constructed with the "qr//"
                       operator.  Again, a successful match against any column name causes the
                       table to be processed.

               subroutine
                         cols => \&my_match_subroutine
                         cols => sub { my ( $id, $oids, $cols ) = @_ ;
                                       ... }

                       Here "cols" is assigned a coderef to a subroutine which returns true if
                       the table matches, false if not.  The subroutine is passed three
                       arguments: the table id as a scalar string ( e.g. 1.2.3), the table id as
                       an arrayref (e.g. "$oids = [ 1, 2, 3]"), and the column names, as an
                       arrayref (e.g. "$cols = [ 'col1', 'col2' ]").  This option gives the
                       calling routine the ability to make arbitrary selections based upon table
                       id and columns.

               "cols" may be passed an arrayref containing any combination of the above:

                 cols => [ 'Snacks01', qr/Snacks\d+/, sub { ... } ]

               Elements in the array may be preceded by a modifier indicating the action to be
               taken if the table matches on that element.  They are the same as the table id
               modifiers mentioned above.

       colre   This is deprecated, and is present for backwards compatibility only.  An arrayref
               containing the regular expressions to match, or a scalar containing a single
               reqular expression

       More than one of these may be used for a single table request. A request may match more
       than one table.  By default a request is used only once (even the "DEFAULT" id match!).
       Set the "MultiMatch" attribute to enable multiple matches per request.

       When attempting to match a table, the following steps are taken:

       1.      The table id is compared to the requests which contain an id match.  The first
               such match is used (in the order given in the passed array).

       2.      If no explicit id match is found, column name matches are attempted.  The first
               such match is used (in the order given in the passed array)

       3.      If no column name match is found (or there were none requested), the first request
               which matches an id of "DEFAULT" is used.

   Specifying the data callbacks
       Callback functions are specified with the callback attributes "start", "end", "hdr",
       "row", and "warn".  They should be set to code references, i.e.

         %table_req = ( ..., start => \&start_func, end => \&end_func )

       To use methods, specify the object with the "obj" key, and the method names via the
       callback attributes, which should be set to strings.  If you don't specify method names
       they will default to (you guessed it) "start", "end", "hdr", "row", and "warn".

         $obj = SomeClass->new();
         # ...
         %table_req_1 = ( ..., obj => $obj );
         %table_req_2 = ( ..., obj => $obj, start => 'start',
                                    end => 'end' );

       You can also have HTML::TableParser create a new object for you for each table by
       specifying the "class" attribute.  By default the constructor is assumed to be the class
       new() method; if not, specify it using the "new" attribute:

         use MyClass;
         %table_req = ( ..., class => 'MyClass', new => 'mynew' );

       To use a function instead of a method for a particular callback, set the callback
       attribute to a code reference:

         %table_req = ( ..., obj => $obj, end => \&end_func );

       You don't have to provide all the callbacks.  You should not use both "obj" and "class" in
       the same table request.

       HTML::TableParser automatically determines if your object or class has one of the required
       methods.  If you wish it not to use a particular method, set it equal to "undef".  For
       example

         %table_req = ( ..., obj => $obj, end => undef )

       indicates the object's end method should not be called, even if it exists.

       You can specify arbitrary data to be passed to the callback functions via the "udata"
       attribute:

         %table_req = ( ..., udata => \%hash_of_my_special_stuff )

   Specifying Data cleanup operations
       Data cleanup operations may be specified uniquely for each table. The available keys are
       "Chomp", "Decode", "Trim".  They should be set to a non-zero value if the operation is to
       be performed.

   Other Attributes
       The "MultiMatch" key is used when a request is capable of handling multiple tables in the
       document.  Ordinarily, a request will process a single table only (even "DEFAULT"
       requests).  Set it to a non-zero value to allow the request to handle more than one table.

LICENSE

       This software is released under the GNU General Public License.  You may find a copy at

          http://www.fsf.org/copyleft/gpl.html

AUTHOR

       Diab Jerius (djerius@cpan.org)