Ubuntu Manpage: Data::TableReader - Extract records from "dirty" tabular data sources

name
version
synopsis
description
attributes
methods
thanks
author
contributor
copyright and license

Provided by: libdata-tablereader-perl_0.011-2_all

NAME

       Data::TableReader - Extract records from "dirty" tabular data sources

VERSION

       version 0.011

SYNOPSIS

         # Find a row in the Excel file containing the headers
         #   "address", "city", "state", "zip" (in any order)
         # and then convert each row under that into a hashref of those fields.

         my $records= Data::TableReader>new(
             input => 'path/to/file.xlsx',
             fields => [qw( address city state zip )],
           )
           ->iterator->all;

       but there's plenty of options to choose from...

         my $tr= Data::TableReader->new(
           # path or file handle
           # let it auto-detect the format (but can override that if we need)
           input => 'path/to/file.csv',

           # We want these fields to exist in the file (identified by headers)
           fields => [
             { name => 'address', header => qr/street|address/i },
             'city',
             'state',
             # can validate with Type::Tiny classes
             { name => 'zip', header => qr/zip\b|postal/i, type => US_Zipcode },
           ],

           # Our data provider is horrible; just ignore any nonsense we encounter
           on_blank_row => 'next',
           on_validation_fail => 'next',

           # Capture warnings and show to user who uploaded file
           log => \(my @messages)
         );

         my $records= $tr->iterator->all;
         ...
         $http_response->body( encode_json({ messages => \@messages }) );

DESCRIPTION

       This module is designed to take "loose" or "dirty" tabular data sources (such as Excel,
       CSV, TSV, or HTML) which may have been edited by non-technical humans and extract the data
       into sanitized records, while also verifying that the data file contains roughly the
       schema you were expecting.  It is primarily intended for making automated imports of data
       from non-automated or unstable sources, and providing human-readable feedback about the
       validity of the data file.

ATTRIBUTES

   input
       This can be a file name or Path::Class instance or file handle or a
       Spreadsheet::ParseExcel::Worksheet object.  If a file handle, it must be seekable in order
       to auto-detect the file format, or you may specify the decoder directly to avoid auto-
       detection.

   decoder
       This is either an instance of Data::TableReader::Decoder, or a class name, or a partial
       class name to be appended as "Data::TableReader::Decoder::$name" or an arrayref or hashref
       of arguments to build the decoder.

       In an arrayref the first argument can be undef, and in a hashref the CLASS argument can be
       missing or undef. In those cases it will be detected from the input attribute and any
       default arguments combined with (and if necessary trumped by) the extra arguments in the
       arrayref or hashref.

       Examples:

         'CSV'
         # becomes Data::TableReader::Decoder::CSV->new()

         [ 'CSV', sep_char => "|" ]
         # becomes Data::TableReader::Decoder::CSV->new(sep_char => "|")

         { CLASS => 'CSV', sep_char => "|" }
         # becomes Data::TableReader::Decoder::CSV->new({ sep_char => "|" })

   fields
       An arrayref of Data::TableReader::Field objects which this module should search for within
       the tables (worksheets etc.) of "input".

       If an element of this array is a hashref or string, it will be coerced to an instance of
       Data::TableReader::Field, with plain strings becoming the "name" attribute.  See "header"
       in Data::TableReader::Field for how names are automatically converted to the header-
       matching regex.

   record_class
       Default is the special value 'HASH' for un-blessed hashref records.  The special value
       'ARRAY' will result in arrayrefs with fields in the same order they were specified in the
       "fields" specification.  Setting it to anything else will return records created with
       "$record_class->new(\%fields);"

   static_field_order
       Boolean, whether the "fields" must be found in columns in the exact order that they were
       specified.  Default is false.

   header_row_at
       Row number, or range of row numbers where the header must be found.  (All row numbers in
       this module are 1-based, to match end-user expectations.)  The default is "[1,10]" to
       limit header scanning to the first 10 rows.  As a special case, if you are reading a
       source which lacks headers and you trust the source to deliver the columns in the right
       order, you can set this to undef if you also set "static_field_order => 1".

   on_unknown_columns
         on_unknown_columns => 'use'  # warn, and then use the table
         on_unknown_columns => 'next' # warn, and then look for another table which matches
         on_unknown_columns => 'die'  # fatal error
         on_unknown_columns => sub {
           my ($reader, $col_headers)= @_;
           ...;
           return $opt; # one of the above values
         }

       This determines handling for columns that aren't associated with any field.  The
       "required" columns must all be found before it considers this setting, but once it has
       found everything it needs to make this a candidate, you might or might not care about the
       leftover columns.

       'use'  (default)
           You don't care if there are extra columns, just log warnings about them and proceed
           extracting from this table.

       'next'
           Extra columns mean that you didn't find the table you wanted.  Log the near-miss, and
           keep searching additional rows or additional tables.

       'die'
           This header is probably what you want, but you consider extra columns to be an error
           condition.  Logs the details and calls "croak".

       "sub {}"
           You can add your own logic to handle this.  Inspect the headers however you like, and
           then return one of the above values.

   on_blank_rows
         on_blank_rows => 'next' # warn, and then skip the row(s)
         on_blank_rows => 'last' # warn, and stop iterating the table
         on_blank_rows => 'die'  # fatal error
         on_blank_rows => 'use'  # actually try to return the blank rows as records
         on_blank_rows => sub {
           my ($reader, $first_blank_rownum, $last_blank_rownum)= @_;
           ...;
           return $opt; # one of the above values
         }

       This determines what happens when you've found the table, are extracting records, and
       encounter a series of blank rows (defined as a row with no printable characters in any
       field) followed by non-blank rows.  If you use the callback, it suppresses the default
       warning, since you can generate your own.

       The default is 'next'.

   on_validation_fail
         on_validation_fail => 'next'  # warn, and then skip the record
         on_validation_fail => 'use'   # warn, and then use the record anyway
         on_validation_fail => 'die'   # fatal error
         on_validation_fail => sub {
           my ($reader, $failures, $values, $context)= @_;
           for (@$failures) {
             my ($field, $value_index, $message)= @$_;
             ...
             # $field is a Data::TableReader::Field
             # $values->[$value_index] is the string that failed validation
             # $message is the error returned from the validation function
             # $context is a string describing the source of the row, like "Row 5"
             # You may modify $values to alter the record that is about to be created
           }
           # Clear the failures array to suppress warnings, if you actually corrected
           # the validation problems.
           @$failures= () if $opt eq 'use';
           # return one of the above constants to tell the iterator what to do next
           return $opt;
         }

       This determines what happens when you've found the table, are extracting records, and one
       row fails its validation.  In addition to deciding an option, the callback gives you a
       chance to alter the record before 'use'ing it.  If you use the callback, it suppresses the
       default warning, since you can generate your own.

       The default is 'die'.

   log
       If undefined (the default) all log messages above 'info' will be emitted with "warn
       "$message\n"".  If set to an object, it should support an API of:

         trace,  is_trace
         debug,  is_debug
         info,   is_info
         warn,   is_warn
         error,  is_error

       such as Log::Any and may other perl logging modules use.  You can also set it to a coderef
       such as:

         my @messages;
         sub { my ($level, $message)= @_;
           push @messages, [ $level, $message ]
             if grep { $level eq $_ } qw( info warn error );
         };

       for a simple way to capture the messages without involving a logging module.  And for
       extra convenience, you can set it to an arrayref which will receive any message that would
       otherwise have gone to 'warn' or 'error'.

METHODS

detect_input_format
my ($class, @args)= $tr->detect_input_format( $filename, $head_of_file );

This is used internally to detect the format of a file, but you can call it manually if
you like. The first argument (optional) is a file name, and the second argument (also
optional) is the first few hundred bytes of the file. Missing arguments will be pulled
from "input" if possible. The return value is the best guess of module name and
constructor arguments that should be used to parse the file. However, this doesn't
guarantee such module actually exists or is installed; it might just echo the file
extension back to you.

find_table
if ($tr->find_table) { ... }

Search through the input for the beginning of the records, identified by a header row
matching the various constraints defined in "fields". If "header_row_at" is undef, then
this does nothing and assumes success.

Returns a boolean of whether it succeeded. This method does not "croak" on failure like
"iterator" does, on the assumption that you want to handle them gracefully. All
diagnostics about the search are logged via "log".

col_map
This is a lazy attribute from table detection. After calling "find_table" you can inspect
which fields were found for each column via this method. If called before "find_table",
this triggers table detection and throws an exception if one isn't found.

Returns an arrayref with one element for each column, each undefined or a reference to the
Field object it matched.

field_map
This is another lazy attribute from table detection, mapping from field name to column
index/indicies which the field will be loaded from. If called before "find_table", this
triggers table detection and throws an exception if one isn't found.

Returns a hashref where key is the field name, and value is either a single column index,
or an arrayref of column indicies if the field is an array field.

iterator
my $iter= $tr->iterator;
while (my $rec= $iter->()) { ... }

Create an iterator. If the table has not been located, then find it and "croak" if it
can't be found. Depending on the decoder and input filehandle, you might only be able to
have one instance of the iterator at a time.

The iterator derives from Data::TableReader::Iterator but also has a method "all" which
returns all records in an arrayref.

my $records= $tr->iterator->all;

THANKS

       Portions of this software were funded by Ellis, Partners in Management Solutions
       <http://www.epmsonline.com/>.

AUTHOR

       Michael Conrad <mike@nrdvana.net>

CONTRIBUTOR

       Christian Walde <walde.christian@gmail.com>

COPYRIGHT AND LICENSE

       This software is copyright (c) 2019 by Michael Conrad.

       This is free software; you can redistribute it and/or modify it under the same terms as
       the Perl 5 programming language system itself.