Ubuntu Manpage: XMLTV - Perl extension to read and write TV listings in XMLTV format

Provided by: libxmltv-perl_0.5.67-0.1_all

NAME

       XMLTV - Perl extension to read and write TV listings in XMLTV format

SYNOPSIS

         use XMLTV;
         my $data = XMLTV::parsefile('tv.xml');
         my ($encoding, $credits, $ch, $progs) = @$data;
         my $langs = [ 'en', 'fr' ];
         print 'source of listings is: ', $credits->{'source-info-name'}, "\n"
             if defined $credits->{'source-info-name'};
         foreach (values %$ch) {
             my ($text, $lang) = @{XMLTV::best_name($langs, $_->{'display-name'})};
             print "channel $_->{id} has name $text\n";
             print "...in language $lang\n" if defined $lang;
         }
         foreach (@$progs) {
             print "programme on channel $_->{channel} at time $_->{start}\n";
             next if not defined $_->{desc};
             foreach (@{$_->{desc}}) {
                 my ($text, $lang) = @$_;
                 print "has description $text\n";
                 print "...in language $lang\n" if defined $lang;
             }
         }

       The value of $data will be something a bit like:

         [ 'UTF-8',
           { 'source-info-name' => 'Ananova', 'generator-info-name' => 'XMLTV' },
           { 'radio-4.bbc.co.uk' => { 'display-name' => [ [ 'en',  'BBC Radio 4' ],
                                                          [ 'en',  'Radio 4'     ],
                                                          [ undef, '4'           ] ],
                                      'id' => 'radio-4.bbc.co.uk' },
             ... },
           [ { start => '200111121800', title => [ [ 'Simpsons', 'en' ] ],
               channel => 'radio-4.bbc.co.uk' },
             ... ] ]

DESCRIPTION

       This module provides an interface to read and write files in XMLTV format (a TV listings
       format defined by xmltv.dtd).  In general element names in the XML correspond to hash keys
       in the Perl data structure.  You can think of this module as a bit like XML::Simple, but
       specialized to the XMLTV file format.

       The Perl data structure corresponding to an XMLTV file has four elements.  The first gives
       the character encoding used for text data, typically UTF-8 or ISO-8859-1.  (The encoding
       value could also be undef meaning 'unknown', when the library can't work out what it is.)
       The second element gives the attributes of the root <tv> element, which give information
       about the source of the TV listings.  The third element is a list of channels, each list
       element being a hash corresponding to one <channel> element.  The fourth element is
       similarly a list of programmes.  More details about the data structure are given later.
       The easiest way to find out what it looks like is to load some small XMLTV files and use
       Data::Dumper to print out the resulting structure.

USAGE

       parse(document)
           Takes an XMLTV document (a string) and returns the Perl data structure.  It is assumed
           that the document is valid XMLTV; if not the routine may die() with an error (although
           the current implementation just warns and continues for most small errors).

           The first element of the listref returned, the encoding, may vary according to the
           encoding of the input document, the versions of perl and "XML::Parser" installed, the
           configuration of the XMLTV library and other factors including, but not limited to,
           the phase of the moon.  With luck it should always be either the encoding of the input
           file or UTF-8.

           Attributes and elements in the XML file whose names begin with 'x-' are skipped
           silently.  You can use these to include information which is not currently handled by
           the XMLTV format, or by this module.

       parsefiles(filename...)
           Like "parse()" but takes one or more filenames instead of a string document.  The data
           returned is the merging of those file contents: the programmes will be concatenated in
           their original order, the channels just put together in arbitrary order (ordering of
           channels should not matter).

           It is necessary that each file have the same character encoding, if not, an exception
           is thrown.  Ideally the credits information would also be the same between all the
           files, since there is no obvious way to merge it - but if the credits information
           differs from one file to the next, one file is picked arbitrarily to provide credits
           and a warning is printed.  If two files give differing channel definitions for the
           same XMLTV channel id, then one is picked arbitrarily and a warning is printed.

           In the simple case, with just one file, you needn't worry about mismatching of
           encodings, credits or channels.

           The deprecated function "parsefile()" is a wrapper allowing just one filename.

       parse_callback(document, encoding_callback, credits_callback, channel_callback,
       programme_callback)
           An alternative interface.  Whereas "parse()" reads the whole document and then returns
           a finished data structure, with this routine you specify a subroutine to be called as
           each <channel> element is read and another for each <programme> element.

           The first argument is the document to parse.  The remaining arguments are code
           references, one for each part of the document.

           The callback for encoding will be called once with a string giving the encoding.  In
           present releases of this module, it is also possible for the value to be undefined
           meaning 'unknown', but it's hoped that future releases will always be able to figure
           out the encoding used.

           The callback for credits will be called once with a hash reference.  For channels and
           programmes, the appropriate function will be called zero or more times depending on
           how many channels / programmes are found in the file.

           The four subroutines will be called in order, that is, the encoding and credits will
           be done before the channel handler is called and all the channels will be dealt with
           before the first programme handler is called.

           If any of the code references is undef, nothing is called for that part of the file.

           For backwards compatibility, if the value for 'encoding callback' is not a code
           reference but a scalar reference, then the encoding found will be stored in that
           scalar.  Similarly if the 'credits callback' is a scalar reference, the scalar it
           points to will be set to point to the hash of credits.  This style of interface is
           deprecated: new code should just use four callbacks.

           For example:

               my $document = '<tv>...</tv>';

               my $encoding;
               sub encoding_cb( $ ) { $encoding = shift }

               my $credits;
               sub credits_cb( $ ) { $credits = shift }

               # The callback for each channel populates this hash.
               my %channels;
               sub channel_cb( $ ) {
                   my $c = shift;
                   $channels{$c->{id}} = $c;
               }

               # The callback for each programme.  We know that channels are
               # always read before programmes, so the %channels hash will be
               # fully populated.
               #
               sub programme_cb( $ ) {
                   my $p = shift;
                   print "got programme: $p->{title}->[0]->[0]\n";
                   my $c = $channels{$p->{channel}};
                   print 'channel name is: ', $c->{'display-name'}->[0]->[0], "\n";
               }

               # Let's go.
               XMLTV::parse_callback($document, \&encoding_cb, \&credits_cb,
                                     \&channel_cb, \&programme_cb);

       parsefiles_callback(encoding_callback, credits_callback, channel_callback,
       programme_callback, filenames...)
           As "parse_callback()" but takes one or more filenames to open, merging their contents
           in the same manner as "parsefiles()".  Note that the reading is still gradual - you
           get the channels and programmes one at a time, as they are read.

           Note that the same <channel> may be present in more than one file, so the channel
           callback will get called more than once.  It's your responsibility to weed out
           duplicate channel elements (since writing them out again requires that each have a
           unique id).

           For compatibility, there is an alias "parsefile_callback()" which is the same but
           takes only a single filename, before the callback arguments.  This is deprecated.

       write_data(data, options...)
           Takes a data structure and writes it as XML to standard output.  Any extra arguments
           are passed on to XML::Writer's constructor, for example

               my $f = new IO::File '>out.xml'; die if not $f;
               write_data($data, OUTPUT => $f);

           The encoding used for the output is given by the first element of the data.

           Normally, there will be a warning for any Perl data which is not understood and cannot
           be written as XMLTV, such as strange keys in hashes.  But as an exception, any hash
           key beginning with an underscore will be skipped over silently.  You can store
           'internal use only' data this way.

           If a programme or channel hash contains a key beginning with 'debug', this key and its
           value will be written out as a comment inside the <programme> or <channel> element.
           This lets you include small debugging messages in the XML output.

       best_name(languages, pairs [, comparator])
           The XMLTV format contains many places where human-readable text is given an optional
           'lang' attribute, to allow mixed languages.  This is represented in Perl as a pair [
           text, lang ], although the second element may be missing or undef if the language is
           unknown.  When several alernatives for an element (such as <title>) can be given, the
           representation is a list of [ text, lang ] pairs.  Given such a list, what is the best
           text to use?  It depends on the user's preferred language.

           This function takes a list of acceptable languages and a list of [string, language]
           pairs, and finds the best one to use.  This means first finding the appropriate
           language and then picking the 'best' string in that language.

           The best is normally defined as the first one found in a usable language, since the
           XMLTV format puts the most canonical versions first.  But you can pass in your own
           comparison function, for example if you want to choose the shortest piece of text that
           is in an acceptable language.

           The acceptable languages should be a reference to a list of language codes looking
           like 'ru', or like 'de_DE'.  The text pairs should be a reference to a list of pairs [
           string, language ].  (As a special case if this list is empty or undef, that means no
           text is present, and the result is undef.)  The third argument if present should be a
           cmp-style function that compares two strings of text and returns 1 if the first
           argument is better, -1 if the second better, 0 if they're equally good.

           Returns: [s, l] pair, where s is the best of the strings to use and l is its language.
           This pair is 'live' - it is one of those from the list passed in.  So you can use
           "best_name()" to find the best pair from a list and then modify the content of that
           pair.

           (This routine depends on the "Lingua::Preferred" module being installed; if that
           module is missing then the first available language is always chosen.)

           Example:

               my $langs = [ 'de', 'fr' ]; # German or French, please

               # Say we found the following under $p->{title} for a programme $p.
               my $pairs = [ [ 'La CitE des enfants perdus', 'fr' ],
                             [ 'The City of Lost Children', 'en_US' ] ];

               my $best = best_name($langs, $pairs);
               print "chose title $best->[0]\n";

       list_channel_keys(), list_programme_keys()
           Some users of this module may wish to enquire at runtime about which keys a programme
           or channel hash can contain.  The data in the hash comes from the attributes and
           subelements of the corresponding element in the XML.  The values of attributes are
           simply stored as strings, while subelements are processed with a handler which may
           return a complex data structure.  These subroutines returns a hash mapping key to
           handler name and multiplicity.  This lets you know what data types can be expected
           under each key.  For keys which come from attributes rather than subelements, the
           handler is set to 'scalar', just as for subelements which give a simple string.  See
           "DATA STRUCTURE" for details on what the different handler names mean.

           It is not possible to find out which keys are mandatory and which optional, only a
           list of all those which might possibly be present.  An example use of these routines
           is the tv_grep program, which creates its allowed command line arguments from the
           names of programme subelements.

       catfiles(w_args, filename...)
           Concatenate several listings files, writing the output to somewhere specified by
           "w_args".  Programmes are catenated together, channels are merged, for credits we just
           take the first and warn if the others differ.

           The first argument is a hash reference giving information to pass to "XMLTV::Writer"'s
           constructor.  But do not specify encoding, this will be taken from the input files.
           "catfiles()" will abort if the input files have different encodings, unless the
           'UTF8'=1 argument is passed in.

       cat(data, ...)
           Concatenate (and merge) listings data.  Programmes are catenated together, channels
           are merged, for credits we just take the first and warn if the others differ (except
           that the 'date' of the result is the latest date of all the inputs).

           Whereas "catfiles()" reads and writes files, this function takes already-parsed
           listings data and returns some more listings data.  It is much more memory-hungry.

       cat_noprogrammes
           Like "cat()" but ignores the programme data and just returns encoding, credits and
           channels.  This is in case for scalability reasons you want to handle programmes
           individually, but still merge the smaller data.

DATA STRUCTURE

For completeness, we describe more precisely how channels and programmes are represented
in Perl. Each element of the channels list is a hashref corresponding to one <channel>
element, and likewise for programmes. The possible keys of a channel (programme) hash are
the names of attributes or subelements of <channel> (<programme>).

The values for attributes are not processed in any way; an attribute "fred="jim"" in the
XML will become a hash element with key 'fred', value 'jim'.

But for subelements, there is further processing needed to turn the XML content of a
subelement into Perl data. What is done depends on what type of data is stored under that
subelement. Also, if a certain element can appear several times then the hash key for
that element points to a list of values rather than just one.

The conversion of a subelement's content to and from Perl data is done by a handler. The
most common handler is with-lang, used for human-readable text content plus an optional
'lang' attribute. There are other handlers for other data structures in the file format.
Often two subelements will share the same handler, since they hold the same type of data.
The handlers defined are as follows; note that many of them will silently strip leading
and trailing whitespace in element content. Look at the DTD itself for an explanation of
the whole file format.

Unless specified otherwise, it is not allowed for an element expected to contain text to
have empty content, nor for the text to contain newline characters.

credits
Turns a list of credits (for director, actor, writer, etc.) into a hash mapping 'role'
to a list of names. The names in each role are kept in the same order.

scalar
Reads and writes a simple string as the content of the XML element.

length
Converts the content of a <length> element into a number of seconds (so <length
units="minutes">5</minutes> would be returned as 300). On writing out again tries to
convert a number of seconds to a time in minutes or hours if that would look better.

episode-num
The representation in Perl of XMLTV's odd episode numbers is as a pair of [ content,
system ]. As specified by the DTD, if the system is not given in the file then
'onscreen' is assumed. Whitespace in the 'xmltv_ns' system is unimportant, so on
reading it is normalized to a single space on either side of each dot.

video
The <video> section is converted to a hash. The <present> subelement corresponds to
the key 'present' of this hash, 'yes' and 'no' are converted to Booleans. The same
applies to <colour>. The content of the <aspect> subelement is stored under the key
'aspect'. These keys can be missing in the hash just as the subelements can be
missing in the XML.

audio
This is similar to video. <present> is a Boolean value, while the content of <stereo>
is stored unchanged.

previously-shown
The 'start' and 'channel' attributes are converted to keys in a hash.

presence
The content of the element is ignored: it signfies something by its very presence. So
the conversion from XML to Perl is a constant true value whenever the element is
found; the conversion from Perl to XML is to write out the element if true, don't
write anything if false.

subtitles
The 'type' attribute and the 'language' subelement (both optional) become keys in a
hash. But see language for what to pass as the value of that element.

rating
The rating is represented as a tuple of [ rating, system, icons ]. The last element
is itself a listref of structures returned by the icon handler.

star-rating
In XML this is a string 'X/Y' plus a list of icons. In Perl represented as a pair [
rating, icons ] similar to rating.

Multiple star ratings are now supported. For backward compatibility, you may specify a
single [rating,icon] or the preferred double array
[[rating,system,icon],[rating2,system2,icon2]] (like 'ratings')

icon
An icon in XMLTV files is like the <img> element in HTML. It is represented in Perl
as a hashref with 'src' and optionally 'width' and 'height' keys.

with-lang
In XML something like title can be either <title>Foo</title> or <title
lang="en">Foo</title>. In Perl these are stored as [ 'Foo' ] and [ 'Foo', 'en' ].
For the former [ 'Foo', undef ] would also be okay.

This handler also has two modifiers which may be added to the name after '/'. /e
means that empty text is allowed, and will be returned as the empty tuple [], to mean
that the element is present but has no text. When writing with /e, undef will also be
understood as present-but-empty. You cannot however specify a language if the text is
empty.

The modifier /m means that the text is allowed to span multiple lines.

So for example with-lang/em is a handler for text with language, where the text may be
empty and may contain newlines. Note that the with-lang-or-empty of earlier releases
has been replaced by with-lang/e.

Now, which handlers are used for which subelements (keys) of channels and programmes? And
what is the multiplicity (should you expect a single value or a list of values)?

The following tables map subelements of <channel> and of <programme> to the handlers used
to read and write them. Many elements have their own handler with the same name, and most
of the others use with-lang. The third column specifies the multiplicity of the element:
* (any number) will give a list of values in Perl, + (one or more) will give a nonempty
list, ? (maybe one) will give a scalar, and 1 (exactly one) will give a scalar which is
not undef.

Handlers for <channel>
display-name, with-lang, +
icon, icon, *
url, scalar, *

Handlers for <programme>
title, with-lang, +
sub-title, with-lang, *
desc, with-lang/m, *
credits, credits, ?
date, scalar, ?
category, with-lang, *
keyword, with-lang, *
language, with-lang, ?
orig-language, with-lang, ?
length, length, ?
icon, icon, *
url, scalar, *
country, with-lang, *
episode-num, episode-num, *
video, video, ?
audio, audio, ?
previously-shown, previously-shown, ?
premiere, with-lang/em, ?
last-chance, with-lang/em, ?
new, presence, ?
subtitles, subtitles, *
rating, rating, *
star-rating, star-rating, *

At present, no parsing or validation on dates is done because dates may be partially
specified in XMLTV. For example '2001' means that the year is known but not the month,
day or time of day. Maybe in the future dates will be automatically converted to and from
Date::Manip objects. For now they just use the scalar handler. Similar remarks apply to
URLs.

WRITING

       When reading a file you have the choice of using "parse()" to gulp the whole file and
       return a data structure, or using "parse_callback()" to get the programmes one at a time,
       although channels and other data are still read all at once.

       There is a similar choice when writing data: the "write_data()" routine prints a whole
       XMLTV document at once, but if you want to write an XMLTV document incrementally you can
       manually create an "XMLTV::Writer" object and call methods on it.  Synopsis:

         use XMLTV;
         my $w = new XMLTV::Writer();
         $w->comment("Hello from XML::Writer's comment() method");
         $w->start({ 'generator-info-name' => 'Example code in pod' });
         my %ch = (id => 'test-channel', 'display-name' => [ [ 'Test', 'en' ] ]);
         $w->write_channel(\%ch);
         my %prog = (channel => 'test-channel', start => '200203161500',
                     title => [ [ 'News', 'en' ] ]);
         $w->write_programme(\%prog);
         $w->end();

       XMLTV::Writer inherits from XML::Writer, and provides the following extra or overridden
       methods:

       new(), the constructor
           Creates an XMLTV::Writer object and starts writing an XMLTV file, printing the DOCTYPE
           line.  Arguments are passed on to XML::Writer's constructor, except for the following:

           the 'encoding' key if present gives the XML character encoding.  For example:

             my $w = new XMLTV::Writer(encoding => 'ISO-8859-1');

           If encoding is not specified, XML::Writer's default is used (currently UTF-8).

           XMLTW::Writer can also filter out specific days from the data. This is useful if the
           datasource provides data for periods of time that does not match the days that the
           user has asked for. The filtering is controlled with the days, offset and cutoff
           arguments:

             my $w = new XMLTV::Writer(
                 offset => 1,
                 days => 2,
                 cutoff => "050000" );

           In this example, XMLTV::Writer will discard all entries that do not have starttimes
           larger than or equal to 05:00 tomorrow and less than 05:00 two days after tomorrow.
           The time offset is stripped off the starttime before the comparison is made.

       start()
           Write the start of the <tv> element.  Parameter is a hashref which gives the
           attributes of this element.

       write_channels()
           Write several channels at once.  Parameter is a reference to a hash mapping channel id
           to channel details.  They will be written sorted by id, which is reasonable since the
           order of channels in an XMLTV file isn't significant.

       write_channel()
           Write a single channel.  You can call this routine if you want, but most of the time
           "write_channels()" is a better interface.

       write_programme()
           Write details for a single programme as XML.

       end()
           Say you've finished writing programmes.  This ends the <tv> element and the file.

AUTHOR

       Ed Avis, ed@membled.com