Ubuntu Manpage: RDF::RDFa::Parser - flexible RDFa parser

Provided by: librdf-rdfa-parser-perl_1.097-1_all

NAME

       RDF::RDFa::Parser - flexible RDFa parser

SYNOPSIS

       If you're wanting to work with an RDF::Trine::Model that can be queried with SPARQL, etc:

        use RDF::RDFa::Parser;
        my $url     = 'http://example.com/document.html';
        my $options = RDF::RDFa::Parser::Config->new('xhtml', '1.1');
        my $rdfa    = RDF::RDFa::Parser->new_from_url($url, $options);
        my $model   = $rdfa->graph;

       For dealing with local data:

        use RDF::RDFa::Parser;
        my $base_url = 'http://example.com/document.html';
        my $options  = RDF::RDFa::Parser::Config->new('xhtml', '1.1');
        my $rdfa     = RDF::RDFa::Parser->new($markup, $base_url, $options);
        my $model    = $rdfa->graph;

       A simple set of operations for working with Open Graph Protocol data:

        use RDF::RDFa::Parser;
        my $url     = 'http://www.rottentomatoes.com/m/net/';
        my $options = RDF::RDFa::Parser::Config->tagsoup;
        my $rdfa    = RDF::RDFa::Parser->new_from_url($url, $options);
        print $rdfa->opengraph('title') . "\n";
        print $rdfa->opengraph('image') . "\n";

DESCRIPTION

       RDF::TrineX::Parser::RDFa provides a saner interface for this module.  If you are new to
       parsing RDFa with Perl, then that's the best place to start.

   Forthcoming API Changes
       Some of the logic regarding host language and RDFa version guessing is likely to be
       removed from RDF::RDFa::Parser and RDF::RDFa::Parser::Config, and shifted into
       RDF::TrineX::Parser::RDFa instead.

   Constructors
       "$p = RDF::RDFa::Parser->new($markup, $base, [$config], [$storage])"
           This method creates a new RDF::RDFa::Parser object and returns it.

           The $markup variable may contain an XHTML/XML string, or a XML::LibXML::Document. If a
           string, the document is parsed using XML::LibXML::Parser or HTML::HTML5::Parser,
           depending on the configuration in $config. XML well-formedness errors will cause the
           function to die.

           $base is a URL used to resolve relative links found in the document.

           $config optionally holds an RDF::RDFa::Parser::Config object which determines the set
           of rules used to parse the RDFa. It defaults to XHTML+RDFa 1.1.

           Advanced usage note: $storage optionally holds an RDF::Trine::Store object. If undef,
           then a new temporary store is created.

       "$p = RDF::RDFa::Parser->new_from_url($url, [$config], [$storage])"
       "$p = RDF::RDFa::Parser->new_from_uri($url, [$config], [$storage])"
           $url is a URL to fetch and parse, or an HTTP::Response object.

           $config optionally holds an RDF::RDFa::Parser::Config object which determines the set
           of rules used to parse the RDFa. The default is to determine the configuration by
           looking at the HTTP response Content-Type header; it's probably sensible to keep the
           default.

           $storage optionally holds an RDF::Trine::Store object. If undef, then a new temporary
           store is created.

           This function can also be called as "new_from_url" or "new_from_uri".  Same thing.

       "$p = RDF::RDFa::Parser->new_from_response($response, [$config], [$storage])"
           $response is an "HTTP::Response" object.

           Otherwise the same as "new_from_url".

   Public Methods
       "$p->graph"
           This will return an RDF::Trine::Model containing all the RDFa data found on the page.

           Advanced usage note: If passed a graph URI as a parameter, will return a single named
           graph from within the page. This feature is only useful if you're using named graphs.

       "$p->graphs"
           Advanced usage only.

           Will return a hashref of all named graphs, where the graph name is a key and the value
           is a RDF::Trine::Model tied to a temporary storage.

           This method is only useful if you're using named graphs.

       "$p->opengraph([$property])"
           If $property is provided, will return the value or list of values (if called in list
           context) for that Open Graph Protocol property. (In pure RDF terms, it returns the
           non-bnode objects of triples where the subject is the document base URI; and the
           predicate is $property, with non-URI $property strings taken as having the implicit
           prefix 'http://ogp.me/ns#'. There is no distinction between literal and non-literal
           values; literal datatypes and languages are dropped.)

           If $property is omitted, returns a list of possible properties.

           Example:

             foreach my $property (sort $p->opengraph)
             {
               print "$property :\n";
               foreach my $val (sort $p->opengraph($property))
               {
                 print "  * $val\n";
               }
             }

           See also: <http://opengraphprotocol.org/>.

       "$p->dom"
           Returns the parsed XML::LibXML::Document.

       "$p->uri( [$other_uri] )"
           Returns the base URI of the document being parsed. This will usually be the same as
           the base URI provided to the constructor, but may differ if the document contains a
           <base> HTML element.

           Optionally it may be passed a parameter - an absolute or relative URI - in which case
           it returns the same URI which it was passed as a parameter, but as an absolute URI,
           resolved relative to the document's base URI.

           This seems like two unrelated functions, but if you consider the consequence of
           passing a relative URI consisting of a zero-length string, it in fact makes sense.

       "$p->errors"
           Returns a list of errors and warnings that occurred during parsing.

       "$p->processor_graph"
           As per "$p->errors" but returns data as an RDF model.

       "$p->output_graph"
           An alias for "graph", but does not accept a parameter.

       "$p->processor_and_output_graph"
           Union of the above two graphs.

       "$p->consume"
           Advanced usage only.

           The document is parsed for RDFa. As of RDF::RDFa::Parser 1.09x, this is called
           automatically when needed; you probably don't need to touch it unless you're doing
           interesting things with callbacks.

           Calling "$p->consume(survive => 1)" will avoid crashing (e.g.  when the markup
           provided cannot be parsed), and instead make more errors available in "$p->errors".

       "$p->set_callbacks(\%callbacks)"
           Advanced usage only.

           Set callback functions for the parser to call on certain events. These are only
           necessary if you want to do something especially unusual.

             $p->set_callbacks({
               'pretriple_resource' => sub { ... } ,
               'pretriple_literal'  => sub { ... } ,
               'ontriple'           => undef ,
               'onprefix'           => \&some_function ,
               });

           Either of the two pretriple callbacks can be set to the string 'print' instead of a
           coderef.  This enables built-in callbacks for printing Turtle to STDOUT.

           For details of the callback functions, see the section CALLBACKS. If used,
           "set_callbacks" must be called before "consume". "set_callbacks" returns a reference
           to the parser object itself.

       "$p->element_subjects"
           Advanced usage only.

           Gets/sets a hashref of { xpath => RDF::Trine::Node } mappings.

           This is not touched during normal RDFa parsing, only being used by the @role and @cite
           features where RDF resources (i.e. URIs and blank nodes) are needed to represent XML
           elements themselves.

CALLBACKS

       Several callback functions are provided. These may be set using the "set_callbacks"
       function, which takes a hashref of keys pointing to coderefs. The keys are named for the
       event to fire the callback on.

   ontriple
       This is called once a triple is ready to be added to the graph. (After the pretriple
       callbacks.) The parameters passed to the callback function are:

       •   A reference to the "RDF::RDFa::Parser" object

       •   A hashref of relevant "XML::LibXML::Element" objects (subject, predicate, object,
           graph, current)

       •   An RDF::Trine::Statement object.

       The callback should return 1 to tell the parser to skip this triple (not add it to the
       graph); return 0 otherwise. The callback may modify the RDF::Trine::Statement object.

   onprefix
       This is called when a new CURIE prefix is discovered. The parameters passed to the
       callback function are:

       •   A reference to the "RDF::RDFa::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   The prefix (string, e.g. "foaf")

       •   The expanded URI (string, e.g. "http://xmlns.com/foaf/0.1/")

       The return value of this callback is currently ignored, but you should return 0 in case
       future versions of this module assign significance to the return value.

   ontoken
       This is called when a CURIE or term has been expanded. The parameters are:

       •   A reference to the "RDF::RDFa::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   The CURIE or token as a string (e.g. "foaf:name" or "Stylesheet")

       •   The fully expanded URI

       The callback function must return a fully expanded URI, or if it wants the CURIE to be
       ignored, undef.

   onerror
       This is called when an error occurs:

       •   A reference to the "RDF::RDFa::Parser" object

       •   The error level (RDF::RDFa::Parser::ERR_ERROR or RDF::RDFa::Parser::ERR_WARNING)

       •   An error code

       •   An error message

       •   A hash of other information

       The return value of this callback is currently ignored, but you should return 0 in case
       future versions of this module assign significance to the return value.

       If you do not define an onerror callback, then errors will be output via STDERR and
       warnings will be silent. Either way, you can retrieve errors after parsing using the
       "errors" method.

   pretriple_resource
       This callback is deprecated - use ontriple instead.

       This is called when a triple has been found, but before preparing the triple for adding to
       the model. It is only called for triples with a non-literal object value.

       The parameters passed to the callback function are:

       •   A reference to the "RDF::RDFa::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   Subject URI or bnode (string)

       •   Predicate URI (string)

       •   Object URI or bnode (string)

       •   Graph URI or bnode (string or undef)

       The callback should return 1 to tell the parser to skip this triple (not add it to the
       graph); return 0 otherwise.

   pretriple_literal
       This callback is deprecated - use ontriple instead.

       This is the equivalent of pretriple_resource, but is only called for triples with a
       literal object value.

       The parameters passed to the callback function are:

       •   A reference to the "RDF::RDFa::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   Subject URI or bnode (string)

       •   Predicate URI (string)

       •   Object literal (string)

       •   Datatype URI (string or undef)

       •   Language (string or undef)

       •   Graph URI or bnode (string or undef)

       Beware: sometimes both a datatype and a language will be passed.  This goes beyond the
       normal RDF data model.)

       The callback should return 1 to tell the parser to skip this triple (not add it to the
       graph); return 0 otherwise.

FEATURES

Most features are configurable using RDF::RDFa::Parser::Config.

RDFa Versions
RDF::RDFa::Parser supports RDFa versions 1.0 and 1.1.

1.1 is currently a moving target; support is experimental.

1.1 is the default, but this can be configured using RDF::RDFa::Parser::Config.

Host Languages
RDF::RDFa::Parser supports various different RDFa host languages:

• XHTML

As per the XHTML+RDFa 1.0 and XHTML+RDFa 1.1 specifications.

• HTML 4

Uses an HTML5 (sic) parser; uses @lang instead of @xml:lang; keeps prefixes and terms
case-insensitive; recognises the @rel relations defined in the HTML 4 specification.
Otherwise the same as XHTML.

• HTML5

Uses an HTML5 parser; uses @lang as well as @xml:lang; keeps prefixes and terms case-
insensitive; recognises the @rel relations defined in the HTML5 draft specification.
Otherwise the same as XHTML.

• XML

This is implemented as per the RDFa Core 1.1 specification. There is also support for
"RDFa Core 1.0", for which no specification exists, but has been reverse-engineered by
applying the differences between XHTML+RDFa 1.1 and RDFa Core 1.1 to the XHTML+RDFa
1.0 specification.

Embedded chunks of RDF/XML within XML are supported.

• SVG

For now, a synonym for XML.

• Atom

The <feed> and <entry> elements are treated specially, setting a new subject; IANA-
registered rel keywords are recognised.

By passing "atom_parser=>1" as a Config option, you can also handle Atom's native
semantics. (Uses XML::Atom::OWL. If this module is not installed, this option is
silently ignored.)

Otherwise, the same as XML.

• DataRSS

Defines some default prefixes. Otherwise, the same as Atom.

• OpenDocument XML

That is, XML content formatted along the lines of 'content.xml' in OpenDocument files.

Supports OpenDocument bookmarked ranges used as typed or plain object literals (though
not XML literals); expects RDFa attributes in the XHTML namespace instead of in no
namespace. Otherwise, the same as XML.

• OpenDocument

That is, a ZIP file containing OpenDocument XML files. RDF::RDFa::Parser will do all
the unzipping and combining for you, so you don't have to. The unregistered "jar:"
URI scheme is used to refer to files within the ZIP.

Embedded RDF/XML
Though a rarely used feature, XHTML allows other XML markup languages to be directly
embedded into it. In particular, chunks of RDF/XML can be included in XHTML. While this is
not common in XHTML, it's seen quite often in SVG and other XML markup languages.

When RDF::RDFa::Parser encounters a chunk of RDF/XML in a document it's parsing (i.e. an
element called 'RDF' with namespace 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'), there
are three different courses of action it can take:

0. Continue straight through it.
This is the behaviour that XHTML+RDFa seems to suggest is the right option. It should
mostly not do any harm: triples encoded in RDF/XML will be generally ignored (though
the chunk itself could theoretically end up as part of an XML literal). It will waste
a bit of time though.

1. Parse the RDF/XML.
The parser will parse the RDF/XML properly. If named graphs are enabled, any triples
will be added to a separate graph. This is the behaviour that SVG Tiny 1.2 seems to
suggest is the correct thing to do.

2. Skip the chunk.
This will skip over the RDF element entirely, and thus save you a bit of time.

You can decide which path to take by setting the 'embedded_rdfxml' Config option. For HTML
and XHTML, you probably want to set embedded_rdfxml to '0' (the default) or '2' (a little
faster). For other XML markup languages (e.g. SVG or Atom), then you probably want to set
it to '1'.

(There's also an option '3' which controls how embedded RDF/XML interacts with named
graphs, but this is only really intended for internal use, parsing OpenDocument.)

Named Graphs
The parser has support for named graphs within a single RDFa document. To switch this on,
use the 'graph' Config option.

See also <http://buzzword.org.uk/2009/rdfa4/spec>.

The name of the attribute which indicates graph URIs is by default 'graph', but can be
changed using the 'graph_attr' Config option. This option accepts Clark Notation to
specify a namespaced attribute. By default, the attribute value is interpreted as like the
'about' attribute (i.e. CURIEs, URIs, etc), but if you set the 'graph_type' Config option
to 'id', it will be treated as setting a fragment identifier (like the 'id' attribute).

The 'graph_default' Config option allows you to set the default graph URI/bnode
identifier.

Once you're using named graphs, the "graphs" method becomes useful: it returns a hashref
of { graph_uri => trine_model } pairs. The optional parameter to the "graph" method also
becomes useful.

OpenDocument (ZIP) host language support makes internal use of named graphs, so if you're
parsing OpenDocument, tinker with the graph Config options at your own risk!

Auto Config
RDF::RDFa::Parser has a lot of different Config options to play with. Sometimes it might
be useful to allow the page being parsed to control some of these options. If you switch
on the 'auto_config' Config option, pages can do this.

A page can set options using a specially crafted <meta> tag:

Note that the "content" attribute is an application/x-www-form-urlencoded string (which
must then be HTML-escaped of course). Semicolons may be used instead of ampersands, as
these tend to look nicer:

It's possible to use auto config outside XHTML (e.g. in Atom or SVG) using namespaces:

<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml"
name="http://search.cpan.org/dist/RDF-RDFa-Parser/#auto_config"
content="xhtml_lang=0;xml_base=2;atom_elements=1" />

Any Config option may be given using auto config, except 'use_rtnlx', 'dom_parser', and of
course 'auto_config' itself.

Profiles
Support for Profiles (an experimental RDFa 1.1 feature) was added in version 1.09_00, but
dropped after version 1.096, because the feature was removed from draft specs.

BUGS

RDF::RDFa::Parser 0.21 passed all approved tests in the XHTML+RDFa test suite at the time
of its release.

RDF::RDFa::Parser 0.22 (used in conjunction with HTML::HTML5::Parser 0.01 and
HTML::HTML5::Sanity 0.01) additionally passes all approved tests in the HTML4+RDFa and
HTML5+RDFa test suites at the time of its release; except test cases 0113 and 0121, which
the author of this module believes mandate incorrect HTML parsing.

RDF::RDFa::Parser 1.096_01 passes all approved tests on the default graph (not the
processor graph) in the RDFa 1.1 test suite for language versions 1.0 and host languages
xhtml1, html4 and html5, with the following exceptions which are skipped:

• 0140 - wilful violation, pending proof that the test is backed up by the spec.

• 0198 - an XML canonicalisation test that may be dropped in the future.

• 0212 - wilful violation, as passing this test would require regressing on the old RDFa
1.0 test suite.

• 0251 to 0256 pass with RDFa 1.1 and are skipped in RDFa 1.0 because they use
RDFa-1.1-specific syntax.

• 0256 is additionally skipped in HTML4 mode, as the author believes xml:lang should be
ignored in HTML versions prior to HTML5.

• 0303 - wilful violation, as this feature is simply awful.

Please report any bugs to <http://rt.cpan.org/>.

Common gotchas:

• Are you using the XML catalogue?

RDF::RDFa::Parser maintains a locally cached version of the XHTML+RDFa DTD. This
will normally be within your Perl module directory, in a subdirectory named
"auto/share/dist/RDF-RDFa-Parser/catalogue/". If this is missing, the parser
should still work, but will be very slow.

AUTHOR

       Toby Inkster <tobyink@cpan.org>.

ACKNOWLEDGEMENTS

       Kjetil Kjernsmo <kjetilk@cpan.org> wrote much of the stuff for building RDF::Trine models.
       Neubert Joachim taught me to use XML catalogues, which massively speeds up parsing of
       XHTML files that have DTDs.

COPYRIGHT AND LICENCE

       Copyright 2008-2012 Toby Inkster

       This is free software; you can redistribute it and/or modify it under the same terms as
       the Perl 5 programming language system itself.

DISCLAIMER OF WARRANTIES

       THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
       WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR
       PURPOSE.