Ubuntu Manpage: Spreadsheet::Read - Read the data from a spreadsheet

Provided by: libspreadsheet-read-perl_0.63-1_all

NAME

        Spreadsheet::Read - Read the data from a spreadsheet

SYNOPSIS

        use Spreadsheet::Read;
        my $book  = ReadData ("test.csv", sep => ";");
        my $book  = ReadData ("test.sxc");
        my $book  = ReadData ("test.ods");
        my $book  = ReadData ("test.xls");
        my $book  = ReadData ("test.xlsx");
        my $book  = ReadData ($fh, parser => "xls");

        my $sheet = $book->[1];             # first datasheet
        my $cell  = $book->[1]{A3};         # content of field A3 of sheet 1
        my $cell  = $book->[1]{cell}[1][3]; # same, unformatted

DESCRIPTION

       Spreadsheet::Read tries to transparently read *any* spreadsheet and return its content in a universal
       manner independent of the parsing module that does the actual spreadsheet scanning.

       For OpenOffice and/or LibreOffice this module uses Spreadsheet::ReadSXC
       <http://metacpan.org/release/Spreadsheet-ReadSXC>

       For Microsoft Excel this module uses Spreadsheet::ParseExcel <http://metacpan.org/release/Spreadsheet-
       ParseExcel>, Spreadsheet::ParseXLSX <http://metacpan.org/release/Spreadsheet-ParseXLSX>, or
       Spreadsheet::XLSX <http://metacpan.org/release/Spreadsheet-XLSX> (discouraged).

       For CSV this module uses Text::CSV_XS <http://metacpan.org/release/Text-CSV_XS> or Text::CSV_PP
       <http://metacpan.org/release/Text-CSV_PP>.

       For SquirrelCalc there is a very simplistic built-in parser

   Data structure
       The data is returned as an array reference:

         $book = [
             # Entry 0 is the overall control hash
             { sheets  => 2,
               sheet   => {
                 "Sheet 1"  => 1,
                 "Sheet 2"  => 2,
                 },
               type    => "xls",
               parser  => "Spreadsheet::ParseExcel",
               version => 0.59,
               error   => undef,
               },
             # Entry 1 is the first sheet
             { label   => "Sheet 1",
               maxrow  => 2,
               maxcol  => 4,
               cell    => [ undef,
                 [ undef, 1 ],
                 [ undef, undef, undef, undef, undef, "Nugget" ],
                 ],
               attr    => [],
               merged  => [],
               A1      => 1,
               B5      => "Nugget",
               },
             # Entry 2 is the second sheet
             { label   => "Sheet 2",
               :
               :

       To keep as close contact to spreadsheet users, row and column 1 have index 1 too in the "cell" element of
       the sheet hash, so cell "A1" is the same as "cell" [1, 1] (column first). To switch between the two,
       there are two helper functions available: "cell2cr ()" and "cr2cell ()".

       The "cell" hash entry contains unformatted data, while the hash entries with the traditional labels
       contain the formatted values (if applicable).

       The control hash (the first entry in the returned array ref), contains some spreadsheet meta-data. The
       entry "sheet" is there to be able to find the sheets when accessing them by name:

         my %sheet2 = %{$book->[$book->[0]{sheet}{"Sheet 2"}]};

   Functions
       ReadData

        my $book = ReadData ($source [, option => value [, ... ]]);

        my $book = ReadData ("file.csv", sep => ',', quote => '"');

        my $book = ReadData ("file.xls", dtfmt => "yyyy-mm-dd");

        my $book = ReadData ("file.ods");

        my $book = ReadData ("file.sxc");

        my $book = ReadData ("content.xml");

        my $book = ReadData ($content);

        my $book = ReadData ($fh, parser => "xls");

       Tries to convert the given file, string, or stream to the data structure described above.

       Processing Excel data from a stream or content is supported through a File::Temp
       <https://metacpan.org/release/File-Temp> temporary file or IO::Scalar <https://metacpan.org/release/IO-
       Scalar> when available.

       Spreadsheet::ReadSXC <https://metacpan.org/release/Spreadsheet-ReadSXC> does preserve sheet order as of
       version 0.20.

       Currently supported options are:

       parser
         Force  the  data  to be parsed by a specific format. Possible values are "csv", "prl" (or "perl"), "sc"
         (or "squirelcalc"), "sxc" (or "oo", "ods", "openoffice", "libreoffice") "xls" (or "excel"), and  "xlsx"
         (or "excel2007").

         When parsing streams, instead of files, it is highly recommended to pass this option.

         Spreadsheet::Read supports several underlying parsers per spreadsheet type. It will try those from most
         favored to least favored. When you have a good reason to prefer a different parser, you can set that in
         environment variables. The other options then will not be tested for:

          env SPREADSHEET_READ_CSV=Text::CSV_PP ...

       cells
         Control the generation of named cells (""A1"" etc). Default is true.

       rc
         Control the generation of the {cell}[c][r] entries. Default is true.

       attr
         Control the generation of the {attr}[c][r] entries. Default is false.  See "Cell Attributes" below.

       clip
         If  set,  "ReadData" will remove all trailing rows and columns per sheet that have no visual data. If a
         sheet has no data at all, the sheet will be skipped entirely when this attribute is true.

         This option is only valid if "cells" is true. The default value is true if "cells" is true,  and  false
         otherwise.

       strip
         If set, "ReadData" will remove trailing- and/or leading-whitespace from every field.

           strip  leading  strailing
           -----  -------  ---------
             0      n/a      n/a
             1     strip     n/a
             2      n/a     strip
             3     strip    strip

       sep
         Set separator for CSV. Default is comma ",".

       quote
         Set quote character for CSV. Default is """.

       dtfmt
         Set the format for MS-Excel date fields that are set to use the default date format. The default format
         in  Excel  is  ""m-d-yy"",  which  is  both  not  year  2000  safe, nor very useful. The default is now
         ""yyyy-mm-dd"", which is more ISO-like.

         Note that date formatting in MS-Excel is not reliable at all, as it will store/replace/change the  date
         field  separator  in  already stored formats if you change your locale settings. So the above mentioned
         default can be either ""m-d-yy"" OR ""m/d/yy"" depending on what that specific character happened to be
         at the time the user saved the file.

       debug
         Enable some diagnostic messages to STDERR.

         The    value    determines    how     much     diagnostics     are     dumped     (using     Data::Peek
         <https://metacpan.org/release/Data-Peek>).  A value of 9 and higher will dump the entire structure from
         the back-end parser.

       All other attributes/options will be passed to the underlying parser if that parser supports attributes.

       cr2cell

        my $cell = cr2cell (col, row)

       "cr2cell ()" converts a "(column, row)" pair (1 based) to the traditional cell notation:

         my $cell = cr2cell ( 4, 14); # $cell now "D14"
         my $cell = cr2cell (28,  4); # $cell now "AB4"

       cell2cr

        my ($col, $row) = cell2cr ($cell)

       "cell2cr ()" converts traditional cell notation to a "(column, row)" pair (1 based):

         my ($col, $row) = cell2cr ("D14"); # returns ( 4, 14)
         my ($col, $row) = cell2cr ("AB4"); # returns (28,  4)

       row

        my @row = row ($sheet, $row)

        my @row = Spreadsheet::Read::row ($book->[1], 3)

       Get full row of formatted values (like "$sheet->{A3} .. $sheet->{G3}")

       Note that the indexes in the returned list are 0-based.

       "row  ()"  is  not  imported  by default, so either specify it in the use argument list, or call it fully
       qualified.

       cellrow

        my @row = cellrow ($sheet, $row)

        my @row = Spreadsheet::Read::cellrow ($book->[1], 3)

       Get full row of unformatted values (like "$sheet->{cell}[1][3] .. $sheet->{cell}[7][3]")

       Note that the indexes in the returned list are 0-based.

       "cellrow ()" is not imported by default, so either specify it in the use argument list, or call it  fully
       qualified.

       rows

        my @rows = rows ($sheet)

        my @rows = Spreadsheet::Read::rows ($book->[1])

       Convert "{cell}"'s "[column][row]" to a "[row][column]" list.

       Note that the indexes in the returned list are 0-based, where the index in the "{cell}" entry is 1-based.

       "rows  ()"  is  not  imported by default, so either specify it in the use argument list, or call it fully
       qualified.

       parses

        parses ($format)

        Spreadsheet::Read::parses ("CSV")

       "parses ()" returns Spreadsheet::Read's capability to parse the required format. "ReadData" will pick its
       preferred parser for that format unless overruled. See "parser".

       "parses ()" is not imported by default, so either specify it in the use argument list, or call  it  fully
       qualified.

       Version

        my $v = Version ()

        my $v = Spreadsheet::Read::Version ()

        my $v = Spreadsheet::Read->VERSION;

       Returns the current version of Spreadsheet::Read.

       "Version  ()" is not imported by default, so either specify it in the use argument list, or call it fully
       qualified.

       This function returns exactly the same as "Spreadsheet::Read->VERSION"  returns  and  is  only  kept  for
       backward compatibility reasons.

   Using CSV
       In  case  of  CSV  parsing,  "ReadData" will use the first line of the file to auto-detect the separation
       character if the first argument is a file and both "sep"  and  "quote"  are  not  passed  as  attributes.
       Text::CSV_XS            <https://metacpan.org/release/Text-CSV_XS>            (or            Text::CSV_PP
       <https://metacpan.org/release/Text-CSV_PP>) is able to automatically detect and use "\r" line endings.

       CSV can parse streams too, but be sure to pass "sep" and/or "quote" if these do not match the default ","
       and """.

       When an error is found in the CSV, it is automatically reported (to STDERR).  The  structure  will  store
       the    error    in    "$ss->[0]{error}"    as    anonymous    list    returned    by   "$csv->error_diag"
       <https://metacpan.org/pod/Text::CSV_XS#error_diag>.   See  Text::CSV_XS   <https://metacpan.org/pod/Text-
       CSV_XS> for documentation.

        my $ss = ReadData ("bad.csv");
        $ss->[0]{error} and say $ss->[0]{error}[1];

   Cell Attributes
       If  the constructor was called with "attr" having a true value, effort is made to analyze and store field
       attributes like this:

           { label  => "Sheet 1",
             maxrow => 5,
             maxcol => 2,
             cell   => [ undef,
               [ undef, 1 ],
               [ undef, undef, undef, undef, undef, "Nugget" ],
               ],
             attr   => [ undef,
               [ undef, {
                 type    => "numeric",
                 fgcolor => "#ff0000",
                 bgcolor => undef,
                 font    => "Arial",
                 size    => undef,
                 format  => "## ##0.00",
                 halign  => "right",
                 valign  => "top",
                 uline   => 0,
                 bold    => 0,
                 italic  => 0,
                 wrap    => 0,
                 merged  => 0,
                 hidden  => 0,
                 locked  => 0,
                 enc     => "utf-8",
                 }, ]
               [ undef, undef, undef, undef, undef, {
                 type    => "text",
                 fgcolor => "#e2e2e2",
                 bgcolor => undef,
                 font    => "Letter Gothic",
                 size    => 15,
                 format  => undef,
                 halign  => "left",
                 valign  => "top",
                 uline   => 0,
                 bold    => 0,
                 italic  => 0,
                 wrap    => 0,
                 merged  => 0,
                 hidden  => 0,
                 locked  => 0,
                 enc     => "iso8859-1",
                 }, ]
             merged => [],
             A1     => 1,
             B5     => "Nugget",
             },

       This has now been partially implemented, mainly for Excel, as the other parsers do not (yet) support  all
       of that. YMMV.

       Merged cells

       Note  that  only Spreadsheet::ReadSXC <http://metacpan.org/release/Spreadsheet-ReadSXC> documents the use
       of merged cells, and not in a way useful for the spreadsheet consumer.

       CSV does not support merged cells (though future implementations of CSV for the web might).

       The documentation of merged areas  in  Spreadsheet::ParseExcel  <http://metacpan.org/release/Spreadsheet-
       ParseExcel>  and  Spreadsheet::ParseXLSX <http://metacpan.org/release/Spreadsheet-ParseXLSX> can be found
       in Spreadsheet::ParseExcel::Worksheet <http://metacpan.org/release/Spreadsheet-ParseExcel-Worksheet>  and
       Spreadsheet::ParseExcel::Cell <http://metacpan.org/release/Spreadsheet-ParseExcel-Cell>.

       None  of  basic Spreadsheet::XLSX <http://metacpan.org/release/Spreadsheet-XLSX>, Spreadsheet::ParseExcel
       <http://metacpan.org/release/Spreadsheet-ParseExcel>,             and              Spreadsheet::ParseXLSX
       <http://metacpan.org/release/Spreadsheet-ParseXLSX> manual pages mention merged cells at all.

       This module just tries to return the information in a generic way.

       Given this spreadsheet as an example

        merged.xlsx:

            A     B     C
         +-----+-----------+
        1|     | foo       |
         +-----+           +
        2| bar |           |
         |     +-----+-----+
        3|     | urg | orc |
         +-----+-----+-----+

       the  information  extracted  from  that undocumented information is returned in the "merged" entry of the
       sheet's hash as a list of top-left, bottom-right coordinate pars (col, row, col, row). For given example,
       that would be:

        $ss->{merged} = [
           [ 1, 2, 1, 3 ], # A2-A3
           [ 2, 1, 3, 2 ], # B1-C2
           ];

       When the attributes are also enabled, there is some merge  information  copied  directly  from  the  cell
       information, but again, that stems from code analysis and not from documentation:

        my $ss = ReadData ("merged.xlsx", attr => 1)->[1];
        foreach my $row (1 .. $ss->{maxrow}) {
            foreach my $col (1 .. $ss->{maxcol}) {
                my $cell = cr2cell ($col, $row);
                printf "%s %-3s %d  ", $cell, $ss->{$cell},
                    $ss->{attr}[$col][$row]{merged};
                }
            print "\n";
            }

        A1     0  B1 foo 1  C1     1
        A2 bar 1  B2     1  C2     1
        A3     1  B3 urg 0  C3 orc 0

       In  this  example,  there  is  no  way  to see if "B2" is merged to "A2" or to "B1" without analyzing all
       surrounding cells. This could as well mean "A2:A3", "B1:C1", "B2:C2", as "A2:A3",  "B1:B2",  "C1:C2",  as
       "A2:A3",  "B1:C2".   Use  the  "merged"  entry described above to find out what fields are merged to what
       other fields.

TOOLS

       This modules comes with a few tools that perform tasks from the FAQ, like "How do I select only column  D
       through F from sheet 2 into a CSV file?"

       If the module was installed without the tools, you can find them here:
         https://github.com/Tux/Spreadsheet-Read/tree/master/examples

   "xlscat"
       Show (parts of) a spreadsheet in plain text, CSV, or HTML

        usage: xlscat   [-s <sep>] [-L] [-n] [-A] [-u] [Selection] file.xls
                        [-c | -m]                 [-u] [Selection] file.xls
                         -i                            [-S sheets] file.xls
           Generic options:
              -v[#]       Set verbose level (xlscat/xlsgrep)
              -d[#]       Set debug   level (Spreadsheet::Read)
              -u          Use unformatted values
              --noclip    Do not strip empty sheets and
                          trailing empty rows and columns
              -e <enc>    Set encoding for input and output
              -b <enc>    Set encoding for input
              -a <enc>    Set encoding for output
           Input CSV:
              --in-sep=c  Set input sep_char for CSV
           Input XLS:
              --dtfmt=fmt Specify the default date format to replace 'm-d-yy'
                          the default replacement is 'yyyy-mm-dd'
           Output Text (default):
              -s <sep>    Use separator <sep>. Default '|', \n allowed
              -L          Line up the columns
              -n [skip]   Number lines (prefix with column number)
                          optionally skip <skip> (header) lines
              -A          Show field attributes in ANSI escapes
              -h[#]       Show # header lines
           Output Index only:
              -i          Show sheet names and size only
           Output CSV:
              -c          Output CSV, separator = ','
              -m          Output CSV, separator = ';'
           Output HTML:
              -H          Output HTML
           Selection:
              -S <sheets> Only print sheets <sheets>. 'all' is a valid set
                          Default only prints the first sheet
              -R <rows>   Only print rows    <rows>. Default is 'all'
              -C <cols>   Only print columns <cols>. Default is 'all'
              -F <flds>   Only fields <flds> e.g. -FA3,B16
           Ordering (column numbers in result set *after* selection):
              --sort=spec Sort output (e.g. --sort=3,2r,5n,1rn+2)
                          +#   - first # lines do not sort (header)
                          #    - order on column # lexical ascending
                          #n   - order on column # numeric ascending
                          #r   - order on column # lexical descending
                          #rn  - order on column # numeric descending

   "xlsgrep"
       Show (parts of) a spreadsheet that match a pattern in plain text, CSV, or HTML

        usage: xlsgrep  [-s <sep>] [-L] [-n] [-A] [-u] [Selection] pattern file.xls
                        [-c | -m]                 [-u] [Selection] pattern file.xls
                         -i                            [-S sheets] pattern file.xls
           Generic options:
              -v[#]       Set verbose level (xlscat/xlsgrep)
              -d[#]       Set debug   level (Spreadsheet::Read)
              -u          Use unformatted values
              --noclip    Do not strip empty sheets and
                          trailing empty rows and columns
              -e <enc>    Set encoding for input and output
              -b <enc>    Set encoding for input
              -a <enc>    Set encoding for output
           Input CSV:
              --in-sep=c  Set input sep_char for CSV
           Input XLS:
              --dtfmt=fmt Specify the default date format to replace 'm-d-yy'
                          the default replacement is 'yyyy-mm-dd'
           Output Text (default):
              -s <sep>    Use separator <sep>. Default '|', \n allowed
              -L          Line up the columns
              -n [skip]   Number lines (prefix with column number)
                          optionally skip <skip> (header) lines
              -A          Show field attributes in ANSI escapes
              -h[#]       Show # header lines
           Grep options:
              -i          Ignore case
              -w          Match whole words only
           Output CSV:
              -c          Output CSV, separator = ','
              -m          Output CSV, separator = ';'
           Output HTML:
              -H          Output HTML
           Selection:
              -S <sheets> Only print sheets <sheets>. 'all' is a valid set
                          Default only prints the first sheet
              -R <rows>   Only print rows    <rows>. Default is 'all'
              -C <cols>   Only print columns <cols>. Default is 'all'
              -F <flds>   Only fields <flds> e.g. -FA3,B16
           Ordering (column numbers in result set *after* selection):
              --sort=spec Sort output (e.g. --sort=3,2r,5n,1rn+2)
                          +#   - first # lines do not sort (header)
                          #    - order on column # lexical ascending
                          #n   - order on column # numeric ascending
                          #r   - order on column # lexical descending
                          #rn  - order on column # numeric descending

   "xls2csv"
       Convert a spreadsheet to CSV. This is just a small wrapper over "xlscat".

        usage: xls2csv [ -o file.csv ] file.xls

   "ss2tk"
       Show a spreadsheet in a perl/Tk spreadsheet widget

        usage: ss2tk [-w <width>] [X11 options] file.xls [<pattern>]
               -w <width> use <width> as default column width (4)

   "ssdiff"
       Show the differences between two spreadsheets.

        usage: examples/ssdiff [--verbose[=1]] file.xls file.xlsx

TODO

       Options
           Module Options
             New  Spreadsheet::Read  options  are  bound  to happen. I'm thinking of an option that disables the
             reading of the data entirely to speed up an index request  (how  many  sheets/fields/columns).  See
             "xlscat -i".

           Parser options
             Try  to  transparently  support  as  many  options  as  the  encapsulated modules support regarding
             (un)formatted values, (date) formats, hidden columns rows or fields etc. These could be implemented
             like "attr" above but names "meta", or just be new values in the "attr" hashes.

       Other spreadsheet formats
           I consider adding any spreadsheet interface that offers a usable API.

       Alternative parsers for existing formats
           As long as the alternative has a good reason for its existence, and the API of that parser reasonable
           fits in my approach, I will consider to implement the glue layer, or apply patches to do so  as  long
           as these match what CONTRIBUTING.md describes.

       Add an OO interface
           Consider  making  the ref an object, though I currently don't see the big advantage (yet). Maybe I'll
           make it so that it is a hybrid functional / OO interface.

AUTHOR

       H.Merijn Brand, <h.m.brand@xs4all.nl>

COPYRIGHT AND LICENSE

       Copyright (C) 2005-2015 H.Merijn Brand

       This library is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

perl v5.20.2                                       2015-10-10                                          Read(3pm)

NAME

SYNOPSIS

DESCRIPTION

TOOLS

TODO

SEE ALSO

AUTHOR

COPYRIGHT AND LICENSE