Ubuntu Manpage: XML::RSSLite - lightweight, "relaxed" RSS (and XML-ish) parser

Provided by: libxml-rsslite-perl_0.17+dfsg-2_all

NAME

       XML::RSSLite - lightweight, "relaxed" RSS (and XML-ish) parser

SYNOPSIS

         use XML::RSSLite;

         parseRSS(\%result, \$content);

         print "=== Channel ===\n",
               "Title: $result{'title'}\n",
               "Desc:  $result{'description'}\n",
               "Link:  $result{'link'}\n\n";

         foreach $item (@{$result{'items'}}) {
         print "  --- Item ---\n",
               "  Title: $item->{'title'}\n",
               "  Desc:  $item->{'description'}\n",
               "  Link:  $item->{'link'}\n\n";
         }

DESCRIPTION

       This module attempts to extract the maximum amount of content from available documents, and is less
       concerned with XML compliance than alternatives. Rather than rely on XML::Parser, it uses heuristics and
       good old-fashioned Perl regular expressions. It stores the data in a simple hash structure, and "aliases"
       certain tags so that when done, you can count on having the minimal data necessary for re-constructing a
       valid RSS file. This means you get the basic title, description, and link for a channel and its items.

       This module extracts more usable links by parsing "scriptingNews" and "weblog" formats in addition to RDF
       & RSS. It also "sanitizes" the output for best results. The munging includes:

       Remove html tags to leave plain text
       Remove leading whitespace from URIs
       By defaul strips characters except 0-9~!@#$%^&*()-+=a-zA-Z[];',.:"<>?\s
       Use <url> tags when <link> is empty
       Use misplaced urls in <title> when <link> is empty
       Exract links from <a href=...> if required
       Limit links to ftp and http(s)
       Join relative item urls (beginning with / or #) to the site base

   EXPORT
       parseRSS($outHashRef, $inScalarRef, [$strip])
           inScalarRef - required
               Reference  to  a scalar containing the document to be parsed. NOTE: The contents will effectively
               be destroyed. Make a deep copy first if you care.

           outHashRef - required
               Reference to the hash within which to store the parsed content.

           strip - optional
               An expression indicating the level of winnowing to be performed on the  characters  permitted  in
               the results.

               1 strip non-printable characters
               0 no characters are removed
               undefined (Default) strip everything but:
                   0-9~!@#$%^&*()-+= a-zA-Z[];',.:"<>?\t\n

   EXPORTABLE
       parseXML(\%parsedTree, \$parseThis, 'topTag', $comments);
           parsedTree - required
               Reference to hash to store the parsed document within.

           parseThis  - required
               Reference to scalar containing the document to parse.

           topTag     - optional
               Tag to consider the root node, leaving this undefined is not recommended.

           comments   - optional
               false will remove contents from parseThis
               true will not remove comments from parseThis
               array reference is true, comments are stored here

   CAVEATS
       This is not a conforming parser. It does not handle the following

       •

             <foo bar=">">

       •

             <foo><bar> <bar></bar> <bar></bar> </bar></foo>

       •

             <![CDATA[ ]]>

       •

             PI

       It's non-validating, without a DTD the following cannot be properly addressed

       entities
       namespaces
           This may or may not be arriving in some future release.

AUTHOR

       Jerrad Pierce <jpierce@cpan.org>.

       Scott Thomason <scott@thomasons.org>

LICENSE

       Portions  Copyright (c) 2002,2003,2009 Jerrad Pierce, (c) 2000 Scott Thomason.  All rights reserved. This
       program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

perl v5.36.0                                       2022-11-20                                       RSSLite(3pm)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR

LICENSE