Ubuntu Manpage: HTML::HTML5::Microdata::Parser - fairly experimental parser for HTML 'microdata'

Provided by: libhtml-html5-microdata-parser-perl_0.100-1_all

NAME

       HTML::HTML5::Microdata::Parser - fairly experimental parser for HTML 'microdata'

SYNOPSIS

         use HTML::HTML5::Microdata::Parser;

         my $parser = HTML::HTML5::Microdata::Parser->new($html, $baseURI);
         my $graph  = $parser->graph;

DESCRIPTION

       This package aims to have a roughly compatible API to RDF::RDFa::Parser.

       Microdata is an experimental metadata format, not in wide use. Use this module at your own
       risk.

       $p = HTML::HTML5::Microdata::Parser->new($html, $baseuri, \%options, $storage)
               This method creates a new HTML::HTML5::Microdata::Parser object and returns it.

               The $xhtml variable may contain an XHTML/XML string, or a XML::LibXML::Document.
               If a string, the document is parsed using HTML::HTML5::Parser and
               HTML::HTML5::Sanity, which may throw an exception. HTML::HTML5::Microdata::Parser
               does not catch the exception.

               The base URI is used to resolve relative URIs found in the document.

               Options [default in brackets]:

                 * alt_stylesheet  - Magic rel="alternate stylesheet". [1]
                 * auto_config     - See section "Auto Config" [0]
                 * mhe_lang        - Process <meta http-equiv=Content-Language>.
                                     [1]
                 * prefix_empty    - URI prefix for itemprops of untyped items.
                                     [undef]
                 * strategy        - URI generation strategy for itemprops of
                                     typed items. [HTML::HTML5::Microdata::
                                     Strategy::Heuristic]
                 * tdb_service     - thing-described-by.org when possible. [0]
                 * xhtml_base      - Process <base href> element. [1]
                 * xhtml_lang      - Process @lang. [1]
                 * xhtml_meta      - Process <meta>. [0]
                 * xhtml_cite      - Process @cite. [0]
                 * xhtml_rel       - Process @rel. [0]
                 * xhtml_time      - Process <time> element more nicely. [0]
                 * xhtml_title     - Process <title> element. [0]
                 * xml_lang        - Process @xml:lang. [1]

               $storage is an RDF::Trine::Storage object. If undef, then a new temporary store is
               created.

       $p->xhtml
               Returns the HTML source of the document being parsed.

       $p->uri Returns the base URI of the document being parsed. This will usually be the same
               as the base URI provided to the constructor, but may differ if the document
               contains a <base> HTML element.

               Optionally it may be passed a parameter - an absolute or relative URI - in which
               case it returns the same URI which it was passed as a parameter, but as an
               absolute URI, resolved relative to the document's base URI.

               This seems like two unrelated functions, but if you consider the consequence of
               passing a relative URI consisting of a zero-length string, it in fact makes sense.

       $p->dom Returns the parsed XML::LibXML::Document.

       $p->set_callbacks(\%callbacks)
               Set callback functions for the parser to call on certain events. These are only
               necessary if you want to do something especially unusual.

                 $p->set_callbacks({
                   'pretriple_resource' => sub { ... } ,
                   'pretriple_literal'  => sub { ... } ,
                   'ontriple'           => undef ,
                   });

               Either of the two pretriple callbacks can be set to the string 'print' instead of
               a coderef.  This enables built-in callbacks for printing Turtle to STDOUT.

               For details of the callback functions, see the section CALLBACKS. "set_callbacks"
               must be used before "consume". "set_callbacks" itself returns a reference to the
               parser object itself.

       $p->consume
               The document is parsed for Microdata. Nothing of interest is returned by this
               function, but the triples extracted from the document are passed to the callbacks
               as each one is found.

               The "graph" method automatically calls "consume", so normally you don't need to
               call it manually. If you're using callback functions, it may be useful though.

       $p->consume_microdata_item($element)
               You almost certainly do not want to use this method.

               It will consume a single Microdata item, assuming that $element is an element that
               does or should have the @itemscope attribute set. Returns the URI or blank node
               identifier for the item.

               This method is exposed mostly for the benefit of HTML::HTML5::Microdata::ToRDFa.

       $p->graph()
               This method will return an RDF::Trine::Model object with all statements of the
               full graph.

       $p->graphs()
               Provided for RDF::RDFa::Parser compatibility.

CALLBACKS

       Several callback functions are provided. These may be set using the "set_callbacks"
       function, which taskes a hashref of keys pointing to coderefs. The keys are named for the
       event to fire the callback on.

   pretriple_resource
       This is called when a triple has been found, but before preparing the triple for adding to
       the model. It is only called for triples with a non-literal object value.

       The parameters passed to the callback function are:

       •   A reference to the "HTML::HTML5::Microdata::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   Subject URI or bnode (string)

       •   Predicate URI (string)

       •   Object URI or bnode (string)

       The callback should return 1 to tell the parser to skip this triple (not add it to the
       graph); return 0 otherwise.

   pretriple_literal
       This is the equivalent of pretriple_resource, but is only called for triples with a
       literal object value.

       The parameters passed to the callback function are:

       •   A reference to the "HTML::HTML5::Microdata::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   Subject URI or bnode (string)

       •   Predicate URI (string)

       •   Object literal (string)

       •   Datatype URI (string or undef)

       •   Language (string or undef)

       Beware: sometimes both a datatype and a language will be passed.  This goes beyond the
       normal RDF data model.)

       The callback should return 1 to tell the parser to skip this triple (not add it to the
       graph); return 0 otherwise.

   ontriple
       This is called once a triple is ready to be added to the graph. (After the pretriple
       callbacks.) The parameters passed to the callback function are:

       •   A reference to the "HTML::HTML5::Microdata::Parser" object

       •   A reference to the "XML::LibXML::Element" being parsed

       •   An RDF::Trine::Statement object.

       The callback should return 1 to tell the parser to skip this triple (not add it to the
       graph); return 0 otherwise. The callback may modify the RDF::Trine::Statement object.

ITEMPROP URI GENERATION STRATEGY

       The "itemprop" attribute does not need to be a full URI.

        <div itemscope itemtype="http://example.com/Person">
          <span itemprop="phoneNumber">01234 567 890</span>
        </div>

       The "strategy" option passed to the constructor tells the parser how to convert
       "phoneNumber" in the above example into a URI. This can be a callback function, or an
       object or class that provides a "generate_uri" method.

       Three strategies are bundled with this distribution:

       •   HTML::HTML5::Microdata::Strategy::Basic - don't attempt to convert to a URI

       •   HTML::HTML5::Microdata::Strategy::Heuristic - smart strategy, the default

       •   HTML::HTML5::Microdata::Strategy::Microdata0 - official strategy of early Microdata
           drafts

AUTO CONFIG

       HTML::HTML5::Microdata::Parser has a lot of different options that can be switched on and
       off. Sometimes it might be useful to allow the page being parsed to control some of the
       options. If you switch on the 'auto_config' option, pages can do this.

       A page can set options using a specially crafted <meta> tag:

         <meta name="http://search.cpan.org/dist/HTML-HTML5-Microdata-Parser/#auto_config"
            content="alt_stylesheet=0&amp;prefix_empty=http://example.net/" />

       Note that the "content" attribute is an application/x-www-form-urlencoded string (which
       must then be HTML-escaped of course). Semicolons may be used instead of ampersands, as
       these tend to look nicer:

         <meta name="http://search.cpan.org/dist/HTML-HTML5-Microdata-Parser/#auto_config"
            content="alt_stylesheet=0;prefix_empty=http://example.net/" />

       Any option allowed in the constructor may be given using auto config, except 'auto_config'
       itself.

AUTHOR

       Toby Inkster <tobyink@cpan.org>.

COPYRIGHT AND LICENCE

       Copyright (C) 2009-2011 by Toby Inkster

       This library is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.

DISCLAIMER OF WARRANTIES

       THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
       WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR
       PURPOSE.

NAME

SYNOPSIS

DESCRIPTION

CALLBACKS

ITEMPROP URI GENERATION STRATEGY

AUTO CONFIG

SEE ALSO

AUTHOR

COPYRIGHT AND LICENCE

DISCLAIMER OF WARRANTIES